Table of Contents
- 1. Creation of the 'constant.properties' file
- 2. Creation of 'data.txt' file defines data sets
- 3. Creation of matrix files in the 'data' folder
- 4. Creation of Subject and Trait annotation files
- 5. Creation of other meta data files
How to describe an investigation in XGAP format
In a typical genotype-to-phenotype study, there is information about:
- genotyping (markers measured on individuals),
- phenotyping (traits measured on individuals),
- derived data such as QTL profiles, and
- procedural metadata for example explaining the protocols used.
The XGAP tab delimited text file format allows capturing of all this information.
Below we will use the MetaNetwork investigation of as an example to explain use of XGAP format. In MetaNetwork, the individuals belong to a certain Strain, which is of a certain inbreeding type. The Traits are in this case Metabolites heaving certain mass/charge annotations. Next to the genotype and metabolite data matrices, each of he Markers and Metabolite traits have additional annotations attached. Below this data will be recorded as follows:
An XGAP data fileset is typically created in five steps:
- contant.properties (optional): This optional file allows the central definition of values that are static within the whole data set. We here use it to define 'investigation_name' and 'species_name' centrally.
- data.txt: this file lists the data matrix files in this set.
- Data matrix files: these files contain the observed/calculated data values on Subjects and/or Traits.
- Subject and Trait Annotation files: these files list information about what was measured (Traits: Marker, Metabolite) and on who was measured (Subjects: Invidual, Strain).
- Metadata files: these files contain general investigation information, in this case on Investigation, Species (OntologyTerm?) and Bibliographicalreferences.
- All files are normally in a tabular format requiring particular column headers. An exception to this are the data-matrices which are two-dimensional having column headers and row headers. Another exception is the contant.properties which has a 'key=value' format for each row.
- in practice an XGAP file set contains only one investigation which is practical using contant.properties. However the format allows for multiple investigations into one file set.
Below each of these files is created for the MetaNetwork example.
1. Creation of the 'constant.properties' file
This is an optional step. The constant properties file allows central definition of constant values such that one doesn't need to provide them in each file. For example: in each annotation file one normally needs to define a column 'investigation_name' denote a particular piece of information was defined in a particular investigation (except: matrix files!). However, this would be the same value over the whole data set. Therefore a mechanism has been implemented to define such values centrally.
In the example of MetaNetwork this file looks as follows:
#values that are constant in this file set #for all entities holds that investigation_name = MetaNetwork species_name = Arabidopsis thaliana
2. Creation of 'data.txt' file defines data sets
All XGAP data sets have a data.txt that lists the data matrices in the set. To ensure suitable annotations, the column and row headers of each matrix are always coupled to specific annotations while the matrix cells contain the observed values (see examples below). The file data.txt describes these relationships, as well as the matrix dimensions and the type of data in the cells (decimal or textual).
To describe data matrices, the data.txt has the following columns:
|name||name of the data set. In this case 'data_genotypes' and ' data_metaboliteexpression'|
|investigation_name||name of the investigation this data set is part of. Here ommitted because provided in constant.properties file|
|rowType||reference to the Subjects or Traits being observed|
|colType||reference to the Subjects or Traits being observed|
|valueType||specification of what type of data is in this matrix, either Decimal for numeric data or Textual for non-numeric data|
|totalRows||total number of rows of this matrix|
|totalCols||total number of columns of this matrix|
For the MetaNetwork study the data.txt looks as follows:
name rowType colType valueType totalRows totalCols data_genotypes Marker Individual Decimal 117 162 data_metaboliteexpression Metabolite Individual Decimal 24 162
As you can see, the genotypes have rows with Markers, and columns with Individuals.
3. Creation of matrix files in the 'data' folder
Each of the data sets described in the data.txt file should be available in a subfolder called 'data'.For the creation of these files the following rules hold:
- The names of these files should match the names in data.txt with the suffix of '.txt'. In the MetaNetwork example there should be 'data_genotypes.txt' and 'data_metaboliteexpression.txt'.
- The column and row headers should match appropriate names in the refered annotation files. For example, 'data_genotypes' is a matrix of Individual x Marker and headers should therefor refer to values in 'individual.txt' and 'marker.txt'.
The genotypes data reports genotypic obeservations on markers (rows) and individuals (columns); the two alleles are denoted by either '1' and '2'. A snapshot of this data matrix:
"X1" "X3" "X4" "X5" "X6" "PVV4" 1 1 2 1 2 "AXR-1" 1 1 2 1 2 "HH.335C-Col" 1 1 1 1 2 "DF.162L/164C-Col" 1 1 1 1 2 "EC.480C" 1 1 1 1 2
Note that the column headers (X1, ...) should refer to 'name' values in 'individual.txt' and that the row headers (PVV4, ...) should refer to 'name' values in 'marker.txt'. See below.
The matrix with traits has information about one or more traits, in this case metabolites (rows), measured on the same individuals (columns) that were also genotyped. A snapshot of this data matrix:
"X1" "X3" "X4" "X5" "X6" "3-Hydroxypropyl" NA 942 2402 602 213 "4-Hydroxybutyl" NA 4 10 183 198 "4-Methylsulfinylbutyl" NA 55 62 13386 1671 "3-Butenyl" NA 84 32 18 4339 "3-Methylthiopropyl" NA 3108 569 4 7
Note that the column headers (X1, ...) should refer to 'name' values in 'individual.txt' and that the row headers (3-Hydroxypropyl, ...) should refer to 'name' values in 'metabolite.txt'. See below.
Notes about the matrix file format
The ""'s are not necessary, but can prevent confusion during parsing. The importing process will determine the value seperator (tab in this case) and names with many whitespaces can (in rare cases) cause the parser to think that whitespace is the seperator.
Notice the columnheader is not exactly on top the data columns but shifted one to the left. This is because the rowheaders are also a column but contain not data, therefore the 'first' column header is omitted. Insertion of only a seperator character as a first value is allowed as well.
4. Creation of Subject and Trait annotation files
From the data sets we refered to annotations on Individuals, Markers and Metabolite traits. Below it is shown how to add annotations for each of these. Again, the annotations go into file with the same name and a '.txt' suffix. So the annotations of Individual go into 'individual.txt', Strains go into 'strain.txt', Markers go into 'marker.txt', and Metabolites go into 'metabolite.txt'.
In this case, there is not much information, only their name and their strain of origin. The data model allows also for optional pedigree information. A snapshot of the individual.txt annotation file:
name strain_name X1 Ler x Cvi X3 Ler x Cvi X4 Ler x Cvi X5 Ler x Cvi X6 Ler x Cvi
Strain is a reference to a different type of Subject in the database, Strain. Notice that we refer to this Strain by not using a numeric database id (which will be assigned by the database but we cannot know at this point) but by using a special syntax: "_name". This means the parser will automatically make the reference to the correct strain individual by identifying it by its 'name' attribute. There is however, not yet such a strain present. We add it by creating 'strain.txt', below.
'strain.txt' annotation file
In this case only the straintype is known, which in this case: recombinant inbred by selfing (riself).
name straintype Ler x Cvi riself
'marker.txt' annotation file
The marker annotations go in 'marker.txt'. Here we add vital information for further analysis: the chromosome at which this marker is located, and its centiMorgan position on the chromosome. It may look like this:
"name","chr","cm" "PVV4",1,0 "AXR-1",1,6.398 "HH.335C-Col",1,10.786 "DF.162L/164C-Col",1,12.913 "EC.480C",1,15.059
'metabolite.txt' annotation file
We also add annotation for the metabolites, though with no additional information at this point. Still it is valuable to add them to the database as more annotations may come available later. Also, this ensures consistency if multiple observations including the same metabolites would be included, such as QTL profiles or correlation data.
"name" "3-Hydroxypropyl" "4-Hydroxybutyl" "4-Methylsulfinylbutyl" "3-Butenyl" "3-Methylthiopropyl"
5. Creation of other meta data files
XGAP allows for many more annotations, see XgapDataModel for a listing. In this case we only describe the investigation under which all information is stored should be described in 'investigation.txt' and related publication.
It can hold name, and optionally start date and end date. In this case we only provide a name:
Also minimal information on the species studied has been added, as well as short name to be used in this study.
name Arabidopsis thaliana
We also add information concerning the publication for this investigation in 'bibliographicalreference.txt'.
name authors publication publisher editor year volume issue pages title PMID: 17406631 Fu J, Swertz MA, Keurentjes JJ, Jansen RC. Nat Protoc. - - 2007 - - 685-94 MetaNetwork: a computational protocol for the genetic study of metabolic networks.
This example set can be downloaded from: