Annotation with Blast2GO

Introduction

Annotation is the process of assigning functional categories to gene or gene products. In Blast2GO this assignment is done for each sequence based on the information available for the homologous sequences retrieved by Blast. Blast2GO annotations proceeds through a 2 steps strategy:

  1. All GO terms for the Blast hit sequences are collected
  2. A selection of terms in done from this original pool to extract the most reliable annotation

For the first step, Blast results are parsed and the identifiers of the Blast hits are found and used to query the Gene Ontology database to recover associated functional terms. Also the evidence code of each particular annotation is recovered. The evidence codes indicate how the functional assignment in the Gene Ontology database has been obtained. For example, an evidence code “inferred by direct assay” indicates that the assignment of that funcion to that gene was done based on some experimental assay. This annotation is therefore of high value. If the evidence code is “electronic annotation”, means that the annotation was generated by automatic methods without human intervetion, and therefore is more prone to be erroneous.

Once all this information is gathered, and annotation score is computed for each {GO,Query Sequence} pair and the GO is assignmet to the Query sequence if its annotation score is under a given threshold provided by the user AND there is no children term with a sufficient annotation score. The annotation score is computed as:

                       Annotation score{GO, Seq} = (max.sim * ECw) + (#GO-1 * GOw)

where:

  • max.sim: is the maximal value of similarity between the query and hit sequences that have the given GO annotation
  • ECw: is the weight given to the Evidence Code of the original annotation. Blast2GO has defined values for these weights, which can also be modified by the user. In general, ECw = 1 for experimental evidence codes and ECw < 1 for non-experimental evidence codes.
  • #GO: is the number of annotated children terms
  • GOw: is the weight given to the contribution of annotated children term to a given term

EXAMPLE

Consider a given query sequence with three hit sequences,with the following GO terms:

      Hit sequence 1: 60% similarity; One GO term : GO1 with Evidence Code = IDA
      Hit sequence 2: 65% similarity; One GO terms: GO2 with Evidence Code = ISS
      Hit sequence 3: 67% similarity; One GO terms: GO3 with Evidence Code = IEA
                  GO2 and GO3 are brother terms with parent term GO4

Let compute the Annotation Score (AS) and annotation ouput in a number of scenarios:

  • Scenario 1:
    1. ECw (IDA)=1; ECw(ISS) = 0.8; ECw(IEA) = 0.7 (Evidence Code Control)
    2. Annotation threshold is set to 55
    3. GOw = 0 (no contribution from childer terms)
           AS(GO1) = (60 * 1) + (1-1 * 0) = 60 > 55 --> GO1 is transfered to the query sequence
           AS(GO2) = (65 * 0.8) + (1-1 * 0) = 48 < 55 --> GO2 is NOT transfered
           AS(GO3) = (67 * 0.7) + (1-1 * 0) = 52 < 55 --> GO3 is NOT transfered
           AS(GO4) = (67 * 0.7) + (2-1 * 0) = 52 < 55 --> GO4 is NOT transfered
  • Scenario 2:
    1. ECw (IDA)=1; ECw(ISS) = 0.8; ECw(IEA) = 0.7 (Evidence Code Control)
    2. Annotation threshold is set to 55
    3. GOw = 5 (the childern contribution is enabeled)
         AS(GO1) = (60 * 1) + (1-1 * 5) = 60 > 55 --> GO1 is transfered to the query sequence
         AS(GO2) = (65 * 0.8) + (1-1 * 5) = 48 < 55 --> GO2 is NOT transfered
         AS(GO3) = (67 * 0.7) + (1-1 * 5) = 52 < 55 --> GO3 is NOT transfered
         AS(GO4) = (67 * 0.7) + (2-1 * 5) = 58 > 55 --> GO4 is transfered
  • Scenario 3:
    1. ECw (IDA)=1; ECw(ISS) = 0.8; ECw(IEA) = 0.7 (Evidence Code control)
    2. Annotation threshold is set to 50
    3. GOw = 5 (the childern contribution is enabeled):

AS(GO1) = (60 * 1) + (1-1 * 5) = 60 > 50 –> GO1 is transfered to the query sequence

         AS(GO2) = (65 * 0.8) + (1-1 * 5) = 52 > 50 --> GO2 is transfered to the query sequence
         AS(GO3) = (67 * 0.7) + (1-1 * 5) = 47 < 50 --> GO3 is NOT transfered
         AS(GO4) = (67 * 0.7) + (2-1 * 5) = 52 > 50 --> GO4 is NOT transfered (transferred child)
  • Scenario 4:
    1. ECw (IDA)=1; ECw(ISS) = 1; ECw(IEA) = 1 (no Evidence Code control)
    2. Annotation threshold is set to 55
    3. GOw = 5 (the childern contribution is enabeled):
         AS(GO1) = (60 * 1) + (1-1 * 5) = 60 > 55 --> GO1 is transfered to the query sequence
         AS(GO2) = (65 * 1) + (1-1 * 5) = 65 > 55 --> GO2 is transfered
         AS(GO3) = (67 * 1) + (1-1 * 5) = 67 > 55 --> GO3 is transfered
         AS(GO4) = (67 * 1) + (2-1 * 5) = 72 > 55 --> GO4 is NOT transfered (transferred child)

OTHER FINE-TUNNING PARAMETERS

Additionally, a number of filters can be used to fine-tune the annotation:

  • Number of blast hits: number of hit sequences to extract functional information from
  • Blast min hsp length: the minimal length (in amino acids) of the matching region for a blast hit to be considered
  • Blast Hit Description Filter: an exclusion term for sequence descriptions: you can use this to exclude certain organisms from your hit sequences
  • Blast Hit Description Position:
  • Add ID to Blast Definition?: if you want to add the original sequence ID to the sequence definition
  • Blast descriptor annotator?: This is a text mining method that avoids the assignment of low-informative sequence descriptors. By default, Blast2GO gives to the query sequence the name of the first Blast hit. However, some this first hit is of the type “unnamed sequence” or “XVRST967q56”. In this case, Blast2GO looks for a more descriptive descrition with the remaining hits and returns this when available or NA otherwise. If you want to keep the description of the first sequence regardless its semantic value, turn this box off.
  • E-Value-Hit-Filter: Although the annoation formula in Blast2GO uses similarity value, a cutoff value for a hit sequence to provide candidate annotations is also possible using this option
  • Hsp-Hit Coverage CutOff: this filter excludes hit sequences where the Hsp spans the hit sequence by less than the specified value. Values are given in percentage. If you set a high value (i.e. 80% or 90%), much less hit sequences will be recovered but you can decrease the probability of transfering functions that are located at not matching fragments of the hit sequence.

Exercise

Generate the GO functional annotation for 500 citrus genes and analyse the annotation results (takes some 3 to 5 minutes).

  • Please save this file to you PC and unzip it (use right mouse buttom): zipped XML file This file contains 500 nucleotide sequences blasted against the NCBI non-redundant protein database using the blastX programm. Results are provided in XML format.
  • Upload the XML file into Babelomics with the “upload data” menu selecting the data type “blast”
  • Once the data uploaded you can run Blast2GO. Change the annotation parameters if desired and press “RUN” to start the annotation process.
  • Interpret the generated charts.

How to cite Blast2GO ?

blast2go.txt · Last modified: 2017/05/24 10:36 (external edit)
Driven by DokuWiki Recent changes RSS feed Valid XHTML 1.0 do yourself a favour and use a real browser - get firefox!!