Genome-scale process for identifying signal transduction proteins
Signal transduction proteins are identified based on the presence and/or absence of specific signaling domain profiles which directly or indirectly implicate a protein's role in signal transduction. This process entails retrieving the full genomic dataset, predicting the full domain architecture of all associated proteins, and finally classifying signaling proteins that possess specific signaling domains.
We have developed a sophisticated computational pipeline for automatically performing this identification process as new genomes are integrated into MiST2 (see adjacent figure). First, all complete and draft genomes with Refseq annotations are downloaded from NCBI and the entire genomic record saved to a relational database. Next, the full domain architecture of each protein sequence is predicted using several tools and domain libraries. The foremost component of this step utilizes the HMMER software to reveal well-defined domains by scanning against three domain libraries: 1) Pfam; 2) Agfam, an internal collection of signal transduction profiles; and 3) ECF domain models. The remaining architectural comopnents consists of predicting transmembrane regions (DAS), coiled-coils (Coils), and low-complexity regions (Seg). Finally, any proteins whose domain archtitecture matches one of several signal transduction family "signatures" (domain-based rules) is tagged as belonging to signal transduction. Specifics about the tools used in MiST2 are given in the bioinformatic tools section.