Abbreviation | Species |
Ath | Arabidopsis thaliana |
Dme | Drosophila melanogaster |
Eco | Escherichia coli (strain K12) |
Hsa | Homo sapiens |
Mmu | Mus musculus |
Sce | Saccharomyces cerevisiae |
With the aim of achieving wide coverage, we integrated all publicly available repositories of experimental data collected either through manual curation or direct deposit by the authors, such as Intact (2011-01-19 version), MINT (2011-01-19 version) and BioGRID (version 3.1.72).
Ppis data were download in PSI-MI format version 2.5. These three databases were chosen because they are comparable since they provide data interactions in PSI MI 2.5 standard format, which uses a controlled vocabulary. PSI-MI format was defined by the Proteomic Standards Initiative (PSI) of the Human Proteome Organization (HUPO) to create the interchange standard format for Molecular Interactions data (PSI-MI). PSI-MI contains the minimum information required for reporting a molecular interaction experiment (MIMIx) in a XML schema and is annotated using the detailed controlled vocabulary organized in the Molecular Interactions (MI) ontology.
In order to integrate and unify ppis stored in PSI-MI 2.5 files coming from different databases, several issues should be taken into consideration:
A protein may be annotated with different identiers.
Not all information is presented on the same level of detail.
The quality of the data regarding the signicance of the interaction in vivo is problematic. It can be expected that the experimental conditions for interaction determination have created a substantial number of false positives.
Being confronted with the drawbacks mentioned above, we established a methodology for PPIs curation composed by three essential points:
The first step includes parsing PSI-MI 2.5 files and deducing which identiers from one set correspond to which identiers in the other set. To address this issue, all interactors were mapped to a reference protein defined by UniProt Swiss-Prot. Non-mapping interactors were discarded and corresponding interactions were not considered.
Next, the interactions were filtered for those PPIs which interaction type was a “physically association”. This filtering step ensure to avoid genetic interactions, which do not necessarily imply physical contact between gene products.
Finally, the potential artefactual PPIs were filtered. As stated, false positives are common in interactions data, especially those derived from high-throughput technologies. To avoid selecting PPIs determined through experiments with a similar basis (e.g. “two hybrid array” and “two hybrid gal4 vp16 complementation”), the six lower levels of depth in the MI ontology “interaction detection method” were used. Using the taxon specic field, an interactome was generated for each of the following species: Arabidopsis thaliana, Drosophila melanogaster, Escherichia coli (strain K12), Homo sapiens, Mus musculus and Saccharomyces cerevisiae.
For each specie, two scaffold interactomes were generated:
Curated interactome, containing high-confidence physical interactions detected with two different techniques
Non-curated interactome, containing all physical interactions.
Due to the fact that in many genomic experiments only genes data (but not proteins or transcripts data) is generated, we build up another curated and non-curated interactomes by mapping proteins onto genes.