BASys logo
Home | About | Documentation | Examples | Login
BASys Annotations Breakdown

Listed here are the various algorithms used by the BASys pipeline to generate genome annotations. Selected annotations are provided with information describing their relevance if its purpose may not be immediately obvious. These annotation strategies may change over time. BASys annotations are designated with a version number that can be used to reference the annotation strategy used for a particular annotation report. Additionally, a copy of this file is included with every annotation report.

Version: 1.0
Date: Feb 25, 2005

Annotation LabelDescriptionRelevance
Creation Date The time and date that these annotations were created.
Entry ID For tracking revisions to the annotation set. The default is to use the Accession number and append with "1" (ex: BASYS00001.1). Tracking revisions to the annotation set.
Accession No. Supplied to BASys via a gene identification file, or can be generated automatically. BASys uses "BASYSxxxxx" as a default, where "xxxxx" is a sequential five-digit number (Ex: BASYS00001, BASYS00002, etc.). A unique identifier for the gene card within a set of gene cards.
SWISS PROT (AC and ID) If a similarity search of the query sequence matches exactly to a sequence in the SwissProt database of expertly curated protein sequences, then BASys will supply the SwissProt Accession number here. Cross-reference
Other Databases Transitively applied from a similarity (BLASTP) search of the translated query sequence against SwissProt.
Minimum E-value: 1e-10
Cross-reference
Gene Position The gene position (start, stop, and location) is supplied to BASys in the form of a gene identification file, or a Glimmer prediction.
Centisome Position This is calculated as the distance of the midpoint of the gene's location along the chromosome, in percent. Commonly used location reference
Gene Name
  1. Transitively applied from a SwissProt similarity search.
    Minimum E-value: 1e-10
  2. If there is no hit to SwissProt, then the gene name is transitively applied from a CCDB similarity search.
    Minimum E-value: 1e-10
  3. If there is no hit to SwissProt or CCDB, then the gene name is inherited from the Accession Number.
Alternate Gene Names Any additional gene names compiled from a Gene Name search are listed here.
Upstream 100 Bases Computed directly from the gene location and genomic sequence information. If the genome is linear and the gene location is less than 100 bases from the beginning of the genome sequence, BASys reports the number of upstream bases to the beginning of the genome sequence. Likely location of promoter sequences and 5' UTR
Gene Sequence The gene sequence is supplied to BASys in the form of a gene identification (".ffn") file, a Glimmer prediction, or from the chromosome data.
GC Content [Percent] Calculated directly from the Gene Sequence. Primer /Oligo design, other
Preceding Gene Determined from the coding sequence information supplied to BASys, or from a Glimmer prediction. If there is no preceding gene (i.e. the genome is linear), then BASys annotates this field with a dash (-).
Following Gene Determined from the coding sequence information supplied to BASys, or from a Glimmer prediction. If there is no folowing gene (i.e. the genome is linear), then BASys annotates this field with a dash (-).
Operon Status Determined from an Operon Components analysis. BASys annotates this field as Yes or No depending on whether or not it is included in an operon.
Operon Components BASys assigns as an operon all contiguous coding regions along the same strand (ie uninterrupted by a coding sequence on the reverse strand) where the end position of one gene is within -15 to +30 bases (inclusive) of the start position of the following gene (from [1]). The genes share a related function.
Protein Name
  1. Transitively applied from a similarity (BLASTP) search of the translated query sequence against SwissProt.
    Minimum E-value: 1e-10
  2. If there is no hit to SwissProt, or the hit is to a hypothetical protein, then BASys attempts to transitively assign a protein name from a similarity (BLASTP) search of the translated query sequence against a non-redundant database of bacterial proteins.
    Minimum E-value: 1e-10
  3. If there is no hit to SwissProt or the nr datbase of bacterial proteins, or the hits are only to hypothetical proteins, then BASys annotates the protein name as hypothetical, but appends the Accession Number to the protein name.
Describes the gene product's function.
Alternate Protein Names
  1. If a Protein Name search returns a hit from a similarity search against SwissProt, then any additional, non hypothetical protein names are listed here.
    Minimum E-value: 1e-10
  2. If a Protein Name search returns no hit to SwissProt, but does return a hit to the non-redundant database of bacterial proteins, then any additional, non hypothetical protein names from the nr-bacterial db are listed here.
    Minimum E-value: 1e-10
Sequence Translated: BASys translates the Gene Sequence using the codon table specified with the chromosome submission: Bacterial (default), Universal, or Mycoplasma.
Mature: The translated sequence minus any predicted signal peptides. BASys combines machine learning and heuristics to predict signal peptides:
  1. Signal peptide are predicted with the program PredictSPTM (J. Cruz, unpublished). PredictSPTM uses profile HMMs based on multiple alignments of representative sequences from SignalP [2]. PredictSPTM is part of the Proteus web server for protein structure prediction. Prokaryotic signal peptides are predicted with an accuracy of Q2=95%
  2. If no signal peptides are predicted, then the N-terminal methionine is removed if the neighbouring residue is a proline, serine, alanine, glycine, or threonine residue. Otherwise the mature sequence remains the same as the translated sequence.
No. of Amino Acids Calculated directly from the Sequence annotation.
Cys/Met Content Calculated directly from the Sequence annotation. Useful for 35S labeling.
Molecular Weight [Daltons] Calculated directly from the Sequence annotation using average amino acid residue masses. 2D Gel Electrophoresis
Theoretical pI Calculated directly from the Sequence annotation using an algorithm by David L. Tabb. Pk values are taken from EMBOSS. 2D Gel Electrophoresis
Important Sites. This field includes binding sites and active sites. The Important Sites annotations is transitively applied from a similarity (BLASTP) search of the translated query sequence against SwissProt.
Minimum Identity: 100%
Pfam Domain/Function
  1. Transitively applied from a similarity (BLASTP) search of the translated query sequence against SwissProt.
    Minimum E-value: 1e-10
  2. If there is no hit to SwissProt, BASys searches the Pfam database locally using HMMER.
    Minimum Pfam E-value: 1e-10
Protein domain and family identification.
Signals
  1. Transitively applied from a similarity (BLASTP) search of the translated query sequence against SwissProt.
    Minimum Identity: 100%
  2. Signal peptide are predicted with the program PredictSPTM (J. Cruz, unpublished). PredictSPTM uses profile HMMs based on multiple alignments of representative sequences from SignalP [2]. PredictSPTM is part of the Proteus web server for protein structure prediction. Prokaryotic signal peptides are predicted with an accuracy of Q2=95%
Identifies the protein for translocation
Transmembrane Regions
  1. Transitively applied from a similarity (BLASTP) search of the translated query sequence against SwissProt.
    Minimum Identity: 100%
  2. If there is no hit to SwissProt, or no transmembrane information in the SwissProt record, BASys attempts to predict the transmembrane regions using the PredictSPTM program (J. Cruz, unpublished). Transmembrane regions are predicted using a Naive Bayes classifier trained on a non-redundant set of experimentally determined transmembrane protein sequences from high-resolution structures. PredictSPTM is part of the Proteus web server for protein structure prediction. Transmembrane protein regions are predicted with an accuracy of Q2=92%
Identifies the protein as an integral membrane protein
Secondary Structure
  1. Transitively applied from a similiarity (BLASTP) search of the translated query sequence against the PDB database of 3-D biological macromolecular structure data. The secondary structure data is extracted from the 3-D model using VADAR [3].
    Minimum Identity: 95%
  2. If the hit to PDB is under 95% identity but above 35% identity, then BASys performs homology modelling on the query sequence using the PDB structure as a template. The secondary structure data is extracted from the homology model using VADAR.
    Minimum Identity: 35%
  3. If there is no hit above 35% to the PDB database, then BASys uses PSIPRED [4] to predict the secondary structure. PSIPRED achieves an average Q3 score of 80.6% for secondary structure prediction.
Structure elucidation
PROSITE Motif
  1. Transitively applied from a similarity (BLASTP) search of the translated query sequence against SwissProt.
    Minimum Identity: 100%
  2. If there is no hit to SwissProt, then the translated query sequence is scanned for motifs against the PROSITE [5] database.
Function identification
Specific Function
  1. Transitively applied from a similarity (BLASTP) search of the translated query sequence against SwissProt.
    Minimum Identity: 100%
  2. If there is no hit to SwissProt, BASys attempts to transitively apply the specific function annotation from a similarity (BLASTP) search of the translated query sequence against CCDB.
    Minimum E-value: 1e-10
Metabolic Importance Transitively applied from a similarity (BLASTP) search of the translated query sequence against CCDB.
Minimum E-value: 1e-10
COG Function From a similarity (BLAST) search against the COG database of orthologous groups of proteins [6].
Minimum E-value: 1e-10
Evolutionary classification
COG ID Transitively applied from a similarity (BLAST) search against the COG database of orthologous groups of proteins [6].
Minimum E-value: 1e-10
Cross-reference
Gene Ontology
  1. Transitively applied from a similarity (BLASTP) search of the translated query sequence against SwissProt, if the SwissProt hit contains an InterPro number. BASys maps the InterPro accession to gene ontology number.
    Minimum E-value: 1e-10
  2. If there is no hit to a SwissProt record containing InterPro accessions then the Gene Ontology is transitively applied from a similarity (BLASTP) search of the translated query sequence against CCDB, if the CCDB record contains Gene Ontology information.
    Minimum E-value: 1e-10
A controlled vocabulary describing molecular function, biological process and cellular component.
Cell Location
  1. Transitively applied from a similarity (BLASTP) search of the translated query sequence against SwissProt, if the SwissProt record contains subcellular location information.
    Minimum E-value: 1e-10
  2. If there is no hit to SwissProt, or the SwissProt record does not contain subcellular location information, then BASys performs a keyword search on the Gene Ontology field. If the GO process, function or component show hydrolase activity, ribonuclease activity, nucleic acid binding or RNA binding activity, then BASys assigned the subcellular location as Cytoplasmic.
  3. If (1) and (2) fail to generate an assignment, then BASys performs a keyword search on the Protein Name field. If the protein name is associated with transcriptional activity then the cell location is assigned Cytoplasmic.
  4. If (1-3) fail to generate a subcellular location assignment, the Transmembrane Regions annotation is checked. If transmembrane regions exist for this sequence then the subcellular location is assigned Membrane.
  5. If (1-4) fail to generate a subcellular location assignment, then the subcellular location field is predicted using PSORT-B v.2.0 [7].
  6. If (1-5) fail to generate a subcellular location assignment, then the subcellular location field as assigned Cytoplasmic based on the observation that approximately 80% of all prokaryotic proteins are cytoplasmic.
Similarity Transitively applied from a similarity (BLASTP) search of the translated query sequence against SwissProt, if the SwissProt record contains a SIMILARITY annotation.
Minimum E-value: 1e-10
Homologues Homologues are determined by similarity (BLASTP) search against the H. sapiens, E. coli, S. cerevisiae, Drosophila, and C. elegans model organism databases.
Minimum E-value: 1e-10
Paralogues Paralogues are determined by similarity (BLASTP) search against the proteins identified from the user-supplied chromosome sequence.
Minimum E-value: 1e-10
PDB Accession Transitively applied from a similiarity (BLASTP) search of the translated query sequence against the PDB database of 3-D biological macromolecular structure data.
Minimum Identity: 95%
Cross-reference
Resolution Transitively applied from a similiarity (BLASTP) search of the translated query sequence against the PDB database of 3-D biological macromolecular structure data.
Minimum Identity: 95%
A measure of 3-D structure quality
Structure CLASS Calculated from the Secondary Structure composition [8]:
All Alpha:Alpha > 40%, Beta < 5%
All Beta:Alpha < 5%, Beta > 40%
Mixed Class:Alpha > 15%, Beta > 15%
Unstructured:Alpha + Beta < 30%
High-level classification of protein structure
Cofactors Transitively applied from a similarity (BLASTP) search of the translated query sequence against CCDB.
Minimum E-value: 1e-10
Protein-associated molecules required for proper protein function.
Metal Ions Transitively applied from a similarity (BLASTP) search of the translated query sequence against CCDB.
Minimum E-value: 1e-10
EC Number Transitively applied from a similarity (BLASTP) search of the translated query sequence against the translated KEGG GENES database of bacterial genomes.
Minimum E-value: 1e-10
Enzyme Classification/Nomenclature
Kcat Value [1/min] Transitively applied from a similarity (BLASTP) search of the translated query sequence against CCDB.
Minimum E-value: 1e-10
The maximal catalytic rate for an enzyme when the substrate is saturating
Specific Activity [micromol/min/mg] Transitively applied from a similarity (BLASTP) search of the translated query sequence against CCDB.
Minimum E-value: 1e-10
The amount of product formed by an enzyme per minute per milligram of enzyme.
Km Value [mM] Transitively applied from a similarity (BLASTP) search of the translated query sequence against CCDB.
Minimum E-value: 1e-10
The Michaelis Constant. The catalytic rate under steady-state conditions.
Substrates
  1. Transitively applied from a similarity (BLASTP) search of the translated query sequence against CCDB.
    Minimum E-value: 1e-10
  2. If (1) fails, then BASys maps the EC Number (if one exists for this sequence) to the substrates using an EC:Substrate mapping table from KEGG.
The reactants on which an enzyme acts.
Products
  1. Transitively applied from a similarity (BLASTP) search of the translated query sequence against CCDB.
    Minimum E-value: 1e-10
  2. If (1) fails, then BASys maps the EC Number (if one exists for this sequence) to the products using an EC:Products mapping table from KEGG.
The substance(s) resulting from an enzymatic reaction.
Specific Reaction
  1. Transitively applied from a similarity (BLASTP) search of the translated query sequence against CCDB.
    Minimum E-value: 1e-10
  2. If (1) fails, then BASys maps the EC Number (if one exists for this sequence) to the Specific Reaction using a mapping table from KEGG.
The enzymatic reaction catalyzed by this protein
General Reaction Transitively applied from a similarity (BLASTP) search of the translated query sequence against CCDB.
Minimum E-value: 1e-10
A high-level classification of enzyme function.
Inhibitors Transitively applied from a similarity (BLASTP) search of the translated query sequence against CCDB.
Minimum E-value: 1e-10
Molecules that interact with the enzyme and prevent its normal functioning.
Copy Number Transitively applied from a similarity (BLASTP) search of the translated query sequence against CCDB.
Minimum E-value: 1e-10
From expression studies in E.coli. Many uses including correlation with cellular requirements and cell state.
Structure Determination Priority Calculated locally based on whether there is a solved structure in PDB, a structure in progress in TargetDB, the sequence molecular weight, and the presence of transmembrane helices. The priority ranges from 1 (lowest) to 10 (highest). A priority index for solving the protein structure
TargetDB Status Transitively applied from a similarity (BLASTP) search of the translated query sequence against the TargetDB registration database of protein sequences undergoing structure determination.
Minimum E-value: 1e-10
TargetDB is a target registration database for monitoring the progress of the solution of protein structures.
Availability Transitively applied from a similarity (BLASTP) search of the translated query sequence against the TargetDB registration database of protein sequences undergoing structure determination.
Minimum Identity: 100%
The location of the structural genomics center possessing the plasmid, clone, or protein.
References Transitively applied from a similarity (BLASTP) search of the translated query sequence against SwissProt.
Minimum E-value: 1e-10


References
  1. Salgado H, Moreno-Hagelsieb G, Smith TF, Collado-Vides J. (2000) Operons in Escherichia coli: genomic analyses and predictions. Proc Natl Acad Sci U S A., 97:6652-6657.
  2. Bendtsen JD, Nielsen H, von Heijne G, Brunak S. (2004) Improved prediction of signal peptides: SignalP 3.0. J. Mol. Biol., 340:783-795.
  3. Willard L, Ranjan A, Zhang H, Monzavi H, Boyko RF, Sykes BD, Wishart DS. (2003) VADAR: a web server for quantitative evaluation of protein structure quality. Nucleic Acids Res. 31:3316-3319.
  4. McGuffin LJ, Bryson K, Jones DT. (2000) The PSIPRED protein structure prediction server. Bioinformatics 16:404-405.
  5. Hulo N, Sigrist CJ, Le Saux V, Langendijk-Genevaux PS, Bordoli L, Gattiker A, De Castro E, Bucher P, Bairoch A. (2004) Recent improvements to the PROSITE database. Nucleic Acids Res. 32:D134-137.
  6. Tatusov RL, Koonin EV, Lipman DJ. (1997) A genomic perspective on protein families. Science. 24:631-637.
  7. Gardy JL, Laird MR, Chen F, Rey S, Walsh CJ, Ester M, Brinkman FS. (2005) PSORTb v.2.0: Expanded prediction of bacterial protein subcellular localization and insights gained from comparative proteome analysis. Bioinformatics 21:617-23.
  8. Gromiha MM, Selvaraj S (1998) Protein secondary structure prediction in different structural classes. Protein Engineering 4:249-251.
Valid XHTML 1.0!Valid CSS!