The final aim of a typical genomic experiment is to find a molecular explanation for a given macroscopic observation. Knowing for instance which pathways are affected by the deprivation of glucose in a cell, what biological processes differentiate a healthy control from a diseased case, etc… This functional interpretation of the data is usually performed in two steps:
There are different available tools, such as FatiGO (Al-Shahrour, et al., 2004) and others (Zeeberg, et al., 2003; Khatri and Draghici, 2005), that use different functionally relevant annotations, such as GO terms (Ashburner, et al., 2000), KEGG pathways (Kanehisa, et al., 2004), etc…
Simple enrichment approaches are known to be less sensitive than set enrichment analyses. Whenever is possible the use of set enrichment analysis is preferred over the simple enrichment analysis counterpart.
FatiGO takes two lists of genes (ideally a group of interest and the rest of the genes in the experiment, although any two groups, formed in any way, can be tested against each other) and convert them into two lists of GO annotations using the corresponding gene or protein - term annotation table. Then a Fisher's exact test for 2×2 contingency tables is used to check for significant over-representation of GO annotations in one of the sets with respect to the other one. Multiple test correction to account for the multiple hypothesis tested (one for each functional term) is applied.
In addition to Gene Ontology (Ashburner et al., 2000) annotations, FatiGO can test simultaneously for other functional and regulatory annotations including: KEGG pathways (Kanehisa et al., 2004), InterPro motifs (Mulder et al., 2003), microRNA (Griffiths-Jones et al., 2006), TFBSs (Wingender et al., 2000), cisRED motifs (Robertson et al., 2006), BioCarta pathways, etc.. The distribution of any combination (or all) of the annotations between two groups of genes can be simultaneously tested by means of a Fisher exact test. All the p-values are adjusted by FDR (B&H).
The structure of the functional labels has an important impact in the strategy for performing the test. For example, KEGG pathways have a “flat” organization with a correspondence of one or more pathways per gene. On the other hand, terms in GO have a hierarchical structure called DAG (standing for directed acyclic graph, where each term can have one or more child terms as well as one or more parent terms). Terms at higher levels of the hierarchy (closer to the root) describe more general functions or processes while terms at lower levels are more specific. The level at which a gene is annotated in the GO hierarchy depends on the detail the annotator had on its biological behaviour. Testing terms organised in such way posses an additional difficulty because in same cases they are not exclusive but only constitute descriptions of the same behaviour at different levels of detail (e.g. where is the point in testing apoptosis versus regulation of apoptosis?). Genes annotated with terms that are descendant of the term corresponding to the level chosen therefore take the annotation from the parent. If the level corresponding to, for example, apoptosis was selected, any gene annotated as either apoptosis or as any children term was considered in the same category (apoptosis) for the test. This increases the power of the test. There are less terms, each with more genes, to be tested (Al-Shahrour et al., 2004, 2005).
FatiGO supports many gene identifiers for each organism (HGNC symbol, UniProt/Swiss-Prot, UniProtKB/TrEMBL, Ensembl IDs, RefSeq, EntrezGene, Affymetrix, Agilent, PDB, Protein Id, IPI…), can be checked in the ID converter. These identifiers must be annotated in Ensembl and any gene not annotated in Ensembl will be lost in the analysis. (Please see the Ensembl documentation).
The input data format is a list with a gene or protein identifier per line. See an example of Saccharomyces cerevisiae identifiers list:
YAL011W GAL83 YDR116C YGL104C KNS1 ECM2 YHL018W CDC45 YHL010C YHR199C SNO2 YJR141W YOR059C
A help with all parameters and output results explanation is available in the FatiGO tool parameters page.
How the functional profiling should never be done
It is not uncommon to find the following assertion in papers and talks: “then we examined our set of genes selected in this way (whatever) and we discover that 65% of them were related to metabolism, so we can conclude that our experiment activates metabolism genes”. This could be true or not depending on the relative abundance of this term. If you look to the rest of genes not activated in the experiment and the proportion of them related to metabolism is, let's say 10%, then you are right. Contrarily, if the proportion is, let's say 61%, then the experiment has probably nothing to do with metabolism. The statistical comparison is compulsory to support such assertions.
Comparing two lists of genes
There are many situations in which the comparison of two lists of genes answers a relevant biological question. Actually a large number of problems can be addressed in this way. For example, one might be interested in knowing whether a group of genes that co-express are functionally related. Typically this implies the comparison of a set of genes that clustered together (by any clustering method) to the rest of genes. Other commonly addressed question is if genes differentially expressed when comparing two experimental conditions are functionally related. And many other similar questions are commonly asked when analysing microarray data or, in general, genomic data. The program FatiGO has specifically been designed to answer these kind questions.
The simplest use of the tool is to have a quick look at the functional processes where a set of genes take part of. The list of genes submitted is going to be analysed against the rest of the genome to obtain significance of the GO terms or other sets abundance.
The number of significant functional terms is resumed in a table. If you take a look to the significant results you can sort them by the adjusted pvalue.
You will get a resume table with the number of significant GO terms associated to the genes and then a table for each database with information about the test in each of the significant functional terms. The table can be sorted by the different percentage between the genes annotated in this GO term in each list or by the p-value or p-value adjusted along with a graphical distribution of their frequencies. As you can see the red bars are coloured with darker colour than the blue ones, that means that the terms found are only enriched in the List1, the one we submitted, as we have chosen the Over-represented terms of list 1 option. In this example the significant terms are quite general as they belong to levels 3 to 6, but you can see also a graphical representation of the Gene Ontology terms coloured by their adjusted pvalue.
Submit other jobs playing around with other parameters of the Gene Ontology database (ontology, maximum and minimum level and the direct annotation -using the parents of the terms where the genes are directly annotated-), other databases, pvalue. |
Identically to the previous worked example, FatiGO can be used to check more functional information as pathways, motifs, transcription factors…
The number of significant functional terms for each database are resumed in a table. If you take a look to the significant results you can sort them by the adjusted pvalue.
You will get a resume table with the number of significant GO terms associated to the genes and then a table for each database with information about the test in each of the significant functional terms. The table can be sorted by the different percentage between the genes annotated in this GO term in each list or by the p-value or p-value adjusted along with a graphical distribution of their percentages. As you can see the red bars are coloured with darker colour than the blue ones, that means that the terms found are only enriched in the List1, the one we submitted, as we have chosen the Over-represented terms of List 1 option.
The GO terms are related to apoptosis and the significant KEGG pathways are related as well. The significant BioCarta pathways are also related to the apoptotic process (nothing surprising if given that the list was selected to contain genes related to apoptosis).
Afterwards, launch more jobs choosing other or more databases at a time and change the options parameters. |
Let us exemplify the application of FatiGO with a classical example. We use the data from Chu et al. (1998), The Transcriptional Program of Sporulation in Budding Yeast, Science, 282, 699-705 and cluster the genes according to their expression patterns. We choose a cluster of co-expressing genes and check the hypothesis of ”genes of similar function will tend to co-express”.
If we compare it to the rest of genes in the experiment we can see that several terms related with meiosis and chromosome component are significantly overrepresented in the cluster of co-expressing genes. Keep in mind that this test assumes that you do not have any a priori hypothesis on what biological process is operating in this particular cluster of genes.
Similarly you can explore functional differences using other biologically relevant terms such as pathways's membership or reactions in the Reactome. We can use FatiGO for this purpose.
Observing the significant results are enriched only in the apoptosis related list. The terms are associated to the cell programmed death as can be seen in the GO terms description or the BioCarta pathways. The most clear result is the only one Reactome reaction significant that is not surprisingly apoptosis.
We are going to perform different steps. Firstly we are going to cluster the genes, then we will extract a cluster which finally will be compared to the rest of the genes in the experiment in order to see if one or more biologically relevant terms are overrepresented in the cluster.
The data set used corresponds to an experiment carried out by a group of the Stanford University about the diauxic shift in S. cerevisiae previously mentioned (DeRisi et al., 1997, Exploring the Metabolic and Genetic Control of Gene Expression on a genomic Scale. Science, 278, 680-686). Diauxie describes the growth phases of a bacterial colony as it metabolizes a mixture of sugars. During the first phase, cells preferentially metabolize the sugar whose catabolism is most efficient (often glucose). Only after the first sugar has been exhausted do the cells switch to the second. At the time of the diauxic shift there is often a lag period during which the cell produces the enzymes needed to metabolize the second sugar. The diauxic shift frequently represents a change in metabolism from glucose fermentation to aerobic respiration as the glucose is depleted.
1st part: clustering the gene expression patterns
2nd part: extract the genes of the cluster
Click on the profile of the 1447 gene cluster. You will get a pop-up window containing a list of the genes belonging to the cluster. You could download the cluster by copying and pasting in a text file or can directly send it to FatiGO to do the functional analysis. The aim is to test our cluster against the rest of the genes in the cluster.
3rd part: analyse the cluster in FatiGO
The clustering tool will redirect the cluster extracted as List 1 and the remaining genes as List 2. Don't forget to choose Over-represented terms in List1 as Fisher exact test, complementary list, the specie and the functional databases to test in your cluster. Keep in mind that we want to functionally characterize the 1447 gene cluster respect to the rest of the genes in the experiment.
Terms related to phosphorylation in the biological process or terms related to the mitochondria the respiratory chain in cellular component and the oxidative phosphorylation KEGG pathway are directly involved in the diauxic shift process studied in this experiment.
Try other clusters and other functional databases.
Marmite stands for My Accurate Resource for MIning TExt and implements single enrichment analysis with text-mining derived annotations. Text-mining methods allow extracting informative annotations (bioentities) with different functional, chemical, clinical, etc. meanings, that can be associated to genes. In this case, the association of an annotation to a gene has a strength derived from the number of times that the gene and the annotation are co-cited in a PubMed abstract. A Kolmogorov-Smirnov test is used instead of the conventional Fisher's exact test. Multiple test correction to account for the multiple hypothesis tested (one for each annotation) is applied.
Data is provided by BioAlma who generated the associations using almaKnowledgeServer.
Starting with a set of documents (e.g. the documents where a certain gene appears or a disease) we can define keywords as those words that are significantly overrepresented compared to a standard set or background. These words that appear with much higher frequencies than one would expect by chance can be considered as the content words that capture the main features in this set of documents. In addition to single words bi-grams (two adjacent words) were taken into account because in many cases these terms contain more information than single words (e.g. “cell cycle” vs. “cell”, “cycle”). We refer to words and bi-grams as terms in the following. All words were stemmed before further treatment to increase statistical significance of words. For each term i the number of documents where i appears in the whole collection of documents (xi in N, our background) and in a specific document set a (Xia in Na) is calculated. Then, based on the hypergeometric distribution, the likelihood to find Xia documents in a set of the size n is computed for each term. The more unlikely this event is the more specific is the term i for the document set.
Definitions:
Na ... number of documents of the set a Ndoc ... number of documents in the entire collection Xi ... number of documents where term i appears in Ndoc Xia ... number of documents where term i appears in Na Formula for calculating keyword relevance:
Mean value for term i in collection Na : Mia = Na * (Xi /Ndoc)
The standard deviation of the distribution : σia = sqrt(Mia * (1 - Xi/Ndoc) * (1 - Na/Ndoc))
The Z-score for each term i in a; the higher the score the more relevant is a term for the document set : Zia = (Xia - Mia)/σia
Marmite evaluates the differences between the gene-bioentity co-occurrence values (scores) for two lists of genes. We apply a Kolmogorov-Smirnov Test to each pair of distributions (one per list) formed by the scores of the co-occurrences between a bioentity and the genes within the list. No null values are included into the distributions to evaluate, that is, only genes with a score indicating co-occurrence with the bioentity are included.
Marmite only evaluates bioentities associated to a minimum number of genes within both list (minimum and default is 5 although it can be set by user), this is the way the user may have to control the level of representation of the bioentities presented in the results and the list as a unity. Each test is applied in both sides, that is, per each entity we apply two tests, to see in distribution of list 1 is greater or smaller than distribution of list 2. Marmite have into account multiple test problems and adjusts p-values using FDR.
We only provide annotations for Homo sapiens.
This is a simple example on how Marmite could be applied to the functional annotation of experiments.
We used data coming from a microarray experiment [ West et al. (2005) PLoS Biol 3:e187 ] that studies the differences in the transcriptome in two types of tumours in soft tissues (muscles, tendons, fibrous tissues, nerves, etc): SFT (Solitary Fibrons Tumor) and DTF (Desmoid-type fibromatosis).
These tumours are very different in clinical behaviour but quite similar histologically. This feature makes microarrays a very useful experimental approach to learn more about these kind of cancers, their differences in gene expression and a valuable technique to infer new markers for diagnosis.
Barely, the experiment includes classes for DTF, SFT and other types of cancer, see West et al. for more details.
According to the authors the gene expression patterns are quite different for these two types of tumours and their separation in clusters is very clear.
With such premises, we extracted the samples (columns) for these two classes and then applied an unsupervised clustering algorithm (SOTA method) to the preprocessed matrix of gene expressions in the set of classes to get groups of genes with similar expression patterns. Using the visualization of the cluster we got the two main clusters of genes that if the premise is true will give two groups of genes with different expression profiles in the two classes of tumours (SFT and DTF). Important differences in expression patterns can be appreciated between both clusters in classes SFT and DTF. Therefore, the grouping has been successful in the separation of genes with important roles in both classes.
We extracted these two clusters and got two lists of genes that we used as input for Marmite (list1,list2).
Marmite can extract the differences in the distribution of the co-occurrences measures that these two lists of genes had against three lists of bioentities: Disease associated words, Chemical products and Word roots. The bioentities are associated to genes by a score, being this score a measure of the weight of the association found between each pair (gene-word) in the scientific literature.
Marmite gave as results of a few bioentities (words) with significative differences in the two lists (list1 > list2).
These data is a very valuable information to annotate our experiment, we can see that list 1 contains genes with very specific co-occurrences with words of great importance in the characterization of these kind of tumours.