
Venkata Yedida, Chien-Chung Chan, Zhong-Hui Duan, Department of Computer Science University of Akron
The human genome project and numerous other genome projects have produced a large and ever increasing amount of sequence data. One of the main research challenges in the post-genomic era is to understand the relationship between the nucleotide sequences of genes and the functions of the proteins they encode. In this study, we develop an automated protein function prediction system that is based on a set of homologous proteins and gene ontology categories. A novel measure based on a set of best local alignments is used to identify the homologues. The biological functions of the homologous proteins are characterized with gene ontology annotations. The protein function prediction is performed based on a data mining model using decision trees. The model was trained and tested using the complete proteome of model organism yeast. We show that the decision tree model is fairly easy to implement and analyze and can be used as an effective tool for protein function prediction. We present the accuracy and stability of the decision tree model for yeast protein function prediction.
Mukta Phatak1, Baoqiang Cao2, Michael Wagner3, Jarosÿaw Meller4,5
1Department of Biomedical Engineering, University of Cincinnati, 2University of Nebraska-Lincoln, 3Division of Biomedical Informatics, Cincinnati Children’s Hospital Research Foundation, 4 Department of Environmental Health, University of Cincinnati College of Medicine, 5Department of Informatics, Nicholas Copernicus University, Poland
Predicting 3D structure of a protein from its amino acid sequence is a multi step process and remains one of the main challenges in computational biology. One important intermediate step towards that bigger goal is the prediction of structural attributes such as secondary structure, relative solvent accessibility (RSA) and residue contact number. These attributes can be used to facilitate protein structure prediction and subsequent functional annotations. Previously, we developed novel methods for RSA prediction using both neural network based and support vector (SVR) based regression approaches.
Here, we propose to extend these efforts to membrane proteins. In particular, we develop novel methods to predict the relative lipid accessibility (RLA) of an amino acid residue in a membrane domain, which represents the lipid exposed surface area of that residue in relative terms. In analogy to RSA prediction in soluble proteins, the problem of predicting RLA from the amino acid sequence can be cast as a regression problem and solved using machine learning techniques. The critical difference between soluble and membrane proteins, which makes the latter significantly more challenging, is the relatively small number of high resolution structures, from which to learn. It is thus essential to carefully design and evaluate compact representations and simple models for RLA prediction. In this work, we use low-complexity Support Vector Regression (SVR) approaches that are suitable for training on the limited number of structurally resolved membrane proteins. Moreover, we develop flexible SVR-based models to represent the uncertainty of RLA assignments for residues at the membrane-water interfaces. Using cross-validation on a non-redundant set of alpha-helical membrane domains, we estimate our methods yield correlation coefficients between the observed and predicted RLAs of about 0.5. We conclude that RLA prediction methods are already showing promise towards further applications to structure prediction and identification of membrane domain interactions.
Gregory A. C. Singer, Jiejun Wu, Pearlly Yan, Christoph Plass, Tim H.-M. Huang, and Ramana V. Davuluri
Although examples of multiple promoter genes have been known for over a decade, the one-gene-one-promoter model still dominates. This, despite the fact that many independent lines of evidence show that alternative promoter usage is very common in the human genome. For example, projects like ECgene and Acembly that have undergone the massive task of aligning all sequenced ESTs to the human genome suggest that more than half of human genes have more than one transcription start site (TSS). Corroborating evidence was recently provided by the Riken group, who performed cap analysis of gene expression (CAGE), generating millions of ~20mer tags from the 5' ends of mRNAs. When mapped back to the genome, these tags indicated the presence of thousands of previously unknown TSSs. Even among the UCSC Known Genes (a set of high quality gene annotations), 28% of human genes have more than one TSS, and over a thousand genes have more than three annotated TSSs.
Each promoter--especially those separated by hundreds of bases--possesses its own core promoter elements, transcription factor binding sites, and epigenetic environment (including histone modifications and CpG methylation). Therefore, the promoters can act quite independently of each other, and therefore understanding which promoter is employed in which cellular condition is key to unraveling gene regulatory networks within the cell. To this end, we have annotated all putative promoters in the human genome by integrating ab initio promoter predictions using the program FirstEF, UCSC Known Gene annotations, and CAGE tag evidence. We then designed a custom genome tiling microarray platform that uses 244,000 probes to cover roughly 35,000 putative promoters from a subset of 7,000 genes in the human genome. To demonstrate the utility of this platform, we have analyzed the pattern of promoter usage in the heavily studied MCF7 breast cancer cell line in both control and estradiol-treated conditions. Many promoters were previously considered putative were found to be active, suggesting that a large number of promoters in the human genome remain undiscovered. These novel promoters were found to occur throughout the length of the gene, from more upstream than the current most 5' annotated TSS all the way to the 3'-UTR. Clearly, many of these isoforms encode truncated proteins, or non-coding RNAs. The role these strange isoforms play within the cell is still unknown, but most intriguingly we found a strong tendency for the downstream promoter in E2-sensitive multiple promoter genes to be close to the 3’-terminus of the gene sequence. We hypothesize that these 3'-located promoters may encode small interfering RNAs, or may simply act to block progression of the RNA polymerase II complex initiated from a more upstream promoter.
Hatice Gulcin Ozer, Biophysics Graduate Program, The Ohio State University William C. Ray, Children's Research Institute and The Department of Pediatrics, The Ohio State University
Predicting physical distances between amino acids in protein alignments provides invaluable
information towards anticipation of their complete 3-dimensional structure. Extracting constraints
using only sequence information is an indispensable direction, since the number of known protein or
nucleic acid sequences grows much faster than the number of known 3-dimensional structures.
Detecting interpositional dependencies within the multiple sequene alignments of protein families and
understanding their physical consequences will be a big step in this direction.
In studying positional dependencies, we observed that dependencies are often the result of physical
proximity. Since physicochemical interactions between many identities in the biomolecule are
involved in proper folding and functioning, it is expected to observe dependencies amongst some
positions. Therefore, identification of statistically significant interpositional dependencies within
family alignment will further assist researchers to determine constraints on family structure.
In this study, we examined the critical parameters of interpositional dependencies, also called
pairwise correlations, to estimate structurally important residues for family alignments.
Full abstract
Lonnie Welch1,2,3, Eric Petri1, Dazhang Gu1, Klaus Ecker1
1School of Electrical Engineering and Computer Science, 2Biomedical Engineering Program, 3Molecular and Cellular Biology Program, Ohio University
The purposes of most genomic information are unknown. This limits our ability to understand and address problems that have genetic causes. Does the ‘junk’ portion of genomes have biological meaning? If so, what is the meaning? What are the biological words, phrases, grammar, etc.? The answers will lead to a more complete understanding of the purpose of the genome and the functions of undiscovered genomic elements. This knowledge will help to cure problems that are due genetic causes. We have implemented a Word Seeker tool as illustrated in two data flow diagrams below. Using suffix tree and Teiresias algorithms, the tool discovered elements in Arabidopsis (a model plant genome) that occur with unexpected frequencies in the ‘junk’ portion of the genome, and they are found to be statistically overrepresented. Such elements may form biological words, phrases, and grammar which have biological functions. As one biologist put it, there is no junk DNA.
Full abstract
Sudhindra R. Gadagkar Department of Biology, University of Dayton, Dayton, OH 45469-2320
The non-coding component (approximately 97 percent) of the human genome has been receiving much attention of late for discovering elements that play an important role in transcriptional regulation and other non-coding functions such as DNA replication, chromosome condensation and pairing of chromosomes. Most studies have attempted to identify functionally important non-coding DNA sequences by assessing the conservation of the sequences across genomes and equating functional significance with evolutionary conservation. In this paper we focus on predicting cis-regulatory elements or parts thereof. The process of transcriptional regulation is very complex, involving the precise interaction among dozens of proteins (transcription factors) and between the proteins and non-coding DNA in the vicinity of the protein-coding gene. The production and binding of the initial transcription factors to the DNA near the gene and the subsequent cascade of reactions that takes place in the vicinity of the gene constitute the trans and cis parts, respectively, of gene regulation in eukaryotes. While still inferring functional significance from sequence conservation, however, our approach differs from those of other studies in that our main focus is the extent of conservation within a single genome. We determine the number of times a given DNA motif is repeated in the genome and compare this with the expected number of repeats based on statistical expectations from a non-functional random DNA sequence. This comparison allows us to determine if a given motif is over- or under-represented in the genome or if it is found at a frequency dictated by simple statistical expectations. Relevance to cis-regulation is then more strongly suggested if a given motif is over-, under or at statistical expectations in sequences in the vicinity of protein-coding genes (primarily upstream of the transcription start site – the so-called promoter region that typically harbors most cis-regulatory elements, but also intron sequences and downstream of the 3ÿ UTR). This approach was validated by the following approaches: analyzing known transcription factor binding sites (cis-regulatory elements) and comparing the promoter region of the human and mouse orthologs of a gene (1) known to be differentially expressed between the two genomes, and (2) known to be expressed similarly between the genomes. The approach was then used to analyze all 1024 possible variants of a 5-mer sequence within the human genome and make a list of the motifs that are over- or under-represented at various levels. Interestingly, an unexpectedly large percentage of the motifs was found to share the same level of conservation across genomes, thus strengthening the already suggestive results from single-genome analyses. For example, most of the motifs (85 percent) found to be very highly under-represented in the human genome were also found to be in the very highly under-represented category in Fugu as well. The approach used in this study allows us to make reasonable inferences about the functional significance of very small non-coding DNA motifs (ÿ 5 bases) in a genome without the need for alignments across genomes. Rather, the analysis can be largely confined to within genomes – a simpler and easier approach.
Vivek Kaimal1,3, Anil G Jegga2,3, Bruce J Aronow1,2,3
Departments of Biomedical Engineering1 and Pediatrics2, University of Cincinnati and Division of Biomedical Informatics3, Cincinnati Children’s Hospital Medical Center, Cincinnati OH
Major insights into the understanding of gene regulation at the post-transcriptional level have been provided by the discovery of short ~22nt RNA sequences called microRNAs (miRNAs). Hundreds of miRNAs have since been identified in various species, and it is estimated that about 30% of total human genes are subject to miRNA-mediated regulation. Additionally, these miRNAs have also been shown to exhibit tissue-specific, cell-type specific, developmental stage specific expression and their roles in disease and cancer is also documented. miRNAs typically act through their target genes and each miRNA is capable of regulating several genes and a single gene can be regulated by several different miRNAs. Thus, identifying “true” miRNA targets continues to be a major challenge. In the current study we focus on the role of miRNAs in development and the putative target genes associated with development (based on Gene Ontology annotations). The predicted miRNA regulators for the development-associated genes were compiled using five most widely utilized miRNA-gene target sources. The resulting binary data (miRNA Vs target gene) was subjected to biclique analysis. Additionally, a method to assign statistical significance to the identified clusters based on correlation between expression vectors is explored. Preliminary results using kidney-specific miRNA (mir-194, mir-192, mir-215 and mir-204) targets showed differential expression during kidney developmental stages. Interestingly, all of these renal miRNAs are reported to be deregulated in renal cell carcinoma. We anticipate that our integrative bioinformatics analysis of miRNAs and gene targets along with their expression data during development and disease will further elucidate the complex regulatory mechanisms executed by these micromanagers of gene expression and open new avenues for diagnostic and therapeutic opportunities.
Amit U Sinha1, Raj Bhatnagar1, Anil G Jegga2,3
1Department of Computer Science, 2Department of Pediatrics, University of Cincinnati, 3Division of Biomedical Informatics, Cincinnati Children's Hospital Medical Center
The master regulator p53 tumor-suppressor protein and its downstream genes comprise an intricate gene network. Through coordination of several downstream target genes and upstream transcription factors, p53 is pivotal to a variety of biological functions including apoptosis, cell cycle, DNA damage response, differentiation, angiogenesis and cellular senescence. Additionally, p53 mutations are associated with nearly half of human cancers. Likewise, microRNAs (miRNAs), the recently discovered large family of regulatory RNAs that repress target genes at the post-transcriptional level, have been implicated as having regulatory involvement in a multitude of biological pathways. It is estimated that at least 30% of all human genes are regulated by miRNAs. In addition to roles in normal development, miRNAs are also implicated in a range of human diseases, including cancer. Hypothesizing that p53 mediates miRNA expression and the p53 transcriptional regulatory network is modulated by miRNAs, we used computational approaches to investigate the direct relationship between miRNAs and p53 regulatory network. Specifically, using a customized bioinformatics approach we (a) scanned the miRNA putative promoter regions for conserved p53-binding sites and (b) identified the common putative miRNA regulators of p53 upstream and downstream candidate genes. We strongly believe that elucidation of the miRNA-p53 axis will not only aid in better understanding of the p53 master regulatory network but will further unravel the clinical relevance of miRNAs, especially in cancer.
Hao Sun, Francisco Agosto, George Calin, Ramana V. Davuluri, Human Cancer Genetics Program, Department of Molecular Virology, Immunology, and Medical Genetics, Comprehensive Cancer Center, The Ohio State University, Columbus, OH
MicroRNAs (miRNAs) are a large family of short 20-25nt single-stranded non-coding RNAs recently identified in many eukaryotes, which play important role in gene regulation. But, the regulation and expression of miRNA genes themselves are not well understood because of lack of precise annotation of miRNA gene promoters. A crucial component in the analysis of a miRNA promoter region is the accurate identification of the Transcription Start Site (TSS). In animals, the miRNA primary transcript is rapidly cleaved in the nucleus by the enzyme Drosha, and this presents a technical barrier for the large-scale experimental identification of TSSs. It is also unreliable to infer the promoter region of miRNA genes by directly mapping the precursor miRNA to the genome because the sequence data of known primary transcripts indicate that the TSS may be as close as 50 nt and as far as 2.5 kb upstream of the first precursor miRNA contained within the miRNA primary transcript. In order to identify the promoter regions of miRNAs, we have developed an approach that integrates the Cap Analysis Gene Expression (CAGE) sequence tags, in silico promoter prediction program (FirstEF) and comparative genomics method to computationally predict miRNA gene TSSs for 105 pairs of orthologous miRNA genes (gene clusters) between human and mouse. We predicted 980 TSSs that could be paired as orthologous genes between human and mouse for the comprehensive comparative studies. The sequence identity around upstream predicted TSS (-2K ~ 250 bp) between human and mouse orthologous genes is ~50% on average and the core promoter regions of orthologous genes are also conserved very well. We also identify many putative transcription factor binding sites that potentially regulate the transcription of microRNA genes and conserved between human and mouse orthologous miRNA genes. The data resource created in this work and the results of promoter sequence analysis should lay the firm foundation for deciphering the transcriptional modulations of human and mouse microRNA genes. All the data are deposited and made available through MPromDb for comparative studies.
Rupak Mukhopadhyay, Partho Sarothi Ray, Abul Arif, and Paul L. Fox Department of Cell Biology, Lerner Research Institute, Cleveland Clinic, Cleveland, OH
Interferon (IFN)-γ induces rapid transcription of multiple inflammatory genes in monocytic cells. Ceruloplasmin (Cp), an (IFN)-γ-induced pro-inflammatory gene, is subject to delayed translational silencing which limits its expression. Translational silencing is directed by the heterotetrameric IFN-Gamma Activated Inhibitor of Translation (GAIT) complex consisting of ribosomal protein L13a, Glu- Pro tRNA synthetase, NSAP1, and GAPDH1,2. The GAIT complex forms about 16 h after IFN treatment, binds the bipartite stem-loop GAIT element in the 3’-UTR of Cp mRNA, and blocks initiation of translation3. Phosphorylation of L13a at Ser77, and its subsequent release from the 60S ribosomal subunit, is rate-limiting for GAIT pathway activation since it coincides temporally with GAIT complex formation and translation inhibition. We used complementary bioinformatic and microarray-based approaches to find whether the GAIT complex co-ordinately regulates translational repression of a family of genes, thereby constituting a post-transcriptional operon. RNA structural pattern-matching of 3’-UTR databases and riboimmunoprecipitation-microarray (RIP-CHIP) analysis suggest multiple targets for GAIT-mediated translational silencing. We have verified that VEGF, an angiogenic factor induced by inflammation, is a member of this posttranscriptional operon3. The same analyses identified two related Ser/Thr kinases, zipper-interacting protein kinase (ZIPK) and death-associated protein kinase (DAPK) as putative members of the GAIT-mediated operon; remarkably, these kinases were identified as candidate L13a kinases. Phosphorylation of L13a by ZIPK has been confirmed using biochemical and genetic analyses. ZIPK is a downstream target of DAPK, and together they form a kinase cascade. We have shown that both kinases have functional GAIT elements and are translationally repressed by the GAIT complex. Thus, the GAIT system defines a unique autoregulatory post-transcriptional operon that initiates a delayed negative feedback circuit limiting its own activity. This negative auto-regulatory network may restore the cell to the basal state to permit reactivation of inflammatory gene expression.
X. Liu1,2, W. J. Jessen2, S. Sivaganesan3, B. J. Aronow2, and M. Medvedovic1,2*
1 Department of Environmental Health, University of Cincinnati, Cincinnati, Ohio; 2 Division of Biomedical Informatics, Cincinnati Children’s Hospital Medical Center, Cincinnati, Ohio; 3 Mathematical Sciences Department, University of Cincinnati, Cincinnati, OH
Background
Transcriptional modules (TM) consist of groups of co-regulated genes and transcription factors (TF) regulating their expression[1]. Two high-throughput (HT) experimental technologies, gene expression microarrays[2] and Chromatin Immuno-Precipitation[3] on Chip (ChIP-chip), are capable of producing data informative about expression regulatory mechanism on a genome scale. The optimal approach to joint modeling of data generated by these two complementary biological assays, with the goal of identifying and characterizing TMs, is an important open problem in computational biomedicine.
Results
We developed and validated a novel probabilistic model and related computational procedures for identifying TMs by jointly modeling gene expression and ChIP-chip binding data. We demonstrate an improved functional coherence of the TMs’ produced by the new method when compared to either analyzing expression or ChIP-chip data separately or to alternative approaches for joint analysis[4-6]. We also demonstrate the ability of the new algorithm to identify novel regulatory relationships not revealed by ChIP-chip data alone. The new computational procedure can be used in more or less the same way as one would use simple hierarchical clustering without performing any special transformation of data prior to the analysis. The R and C-source code for implementing our algorithm is incorporated within the R package gimmR which is freely available at http://eh3.uc.edu/gimm.
Conclusions
Our results indicate that, whenever available, ChIP-chip and expression data should be analyzed within the unified probabilistic modeling framework, which will likely result in improved clusters of co-regulated genes and improved ability to detect meaningful regulatory relationships. Given the good statistical properties and the ease of use, the new computational procedure offers a worthy new tool for reconstructing transcriptional regulatory networks.
1. Ihmels J, Friedlander G, Bergmann S, Sarig O, Ziv Y, Barkai N: Revealing modular organization in the yeast transcriptional network. Nat Genet 2002, 31(4):370-377.
2. Schena M, Shalon D, Davis RW, Brown PO: Quantitative monitoring of gene expression patterns with a complementary DNA microarray. Science 1995, 270(5235):467-470.
3. Ren B, Robert F, Wyrick JJ, Aparicio O, Jennings EG, Simon I, Zeitlinger J, Schreiber J, Hannett N, Kanin E et al: Genome-wide location and function of DNA binding proteins. Science 2000, 290(5500):2306-2309.
4. Tanay A, Sharan R, Kupiec M, Shamir R: Revealing modularity and organization in the yeast molecular network by integrated analysis of highly heterogeneous genomewide data. Proc Natl Acad Sci U S A 2004, 101(9):2981-2986.
5. Bar-Joseph Z, Gerber GK, Lee TI, Rinaldi NJ, Yoo JY, Robert F, Gordon DB, Fraenkel E, Jaakkola TS, Young RA et al: Computational discovery of gene modules and regulatory networks. Nat Biotechnol 2003, 21(11):1337-1342.
6. Lemmens K, Dhollander T, De Bie T, Monsieurs P, Engelen K, Smets B, Winderickx J, De Moor B, Marchal K: Inferring transcriptional modules from ChIP-chip, motif and microarray data. Genome Biol 2006, 7(5):R37.
Nuclear magnetic resonance (NMR) spectroscopy is a non-invasive method of acquiring a metabolic profile from biofluids. Identifying biomarkers from these profiles may provide keys to the early detection of exposure to a toxin. Two common features of NMR data sets are small sample size and a large number of variables (i.e. high dimensionality). The high dimensionality arises from each sample spectrum being divided into a large number of regions, each of which is a dimension. Pattern recognition techniques can then be used to identify biomarkers from a data set that consists of metabolic profiles from a small number of samples. A typical first step of this analysis is to individually identify responsive spectral regions, followed by associating these regions with metabolites and biomarkers. In this paper, we evaluate several common alternatives to identify responsive regions, including the fold test, paired t-test, and logistic regression. Further, when performing these types of analyses, the issues of multiple-comparisons and false positive rates must be addressed. We compare several corrections for these issues including the Bonferroni, Holm’s, Westfall and Young, permutation, and bootstrap methods. The results of these statistical tests in combination with the multiple-comparison corrections were compared on both a simulated data set and an NMR-derived toxicology data set. Based on these results, we present a statistical protocol for determining putative biomarkers, designed to mitigate the low sample size, high dimensionality, and false positive issues associated with NMR data.
Dr. Valerie V. Cross and Yi Sun, Miami University, Oxford OH
Today, with the increasing development of computational biology, various large databases are built to describe genomic information and the used experimental data. To guarantee the consistency of the referenced biological concepts in different databases, the Gene Ontology (GO) developed by the Gene Ontology Consortium [1] describes biological concepts and their relationships in a species-independent manner. Biologists have been using the GO terms to annotate genes in various databases. These annotations create a mapping between the GO and gene products. These annotations are being used in determining the similarity between genes and gene products, an important task in post-genomics study. For example, gene similarity measurement is used in validating high-throughput protein interaction data [2], aiding the creation of new pathway modeling tools and clustering methods [3], and facilitating the detection of functionally related gene products independent of homology [4].
Numerous tools have been developed to analyze gene product data. For example, GeneInfoViz [5] is a web-based tool used to retrieve gene information and to construct and visualize gene relation networks based on the GO. The Gene Ontology Categorizer [6] uses the GO terminology to summarize or categorize an input set of genes. A goal of this research is to develop a system for QUerying with Ontological Terminologies and their Annotations (QUOTA) [7] that is applicable to all domains having an ontology of annotating terms and files or databases of annotated objects. In this presentation a brief overview of ontologies with the Gene Ontology used as a concrete example is provided. Since a central component of QUOTA is the measurement of similarity between the annotated objects, the variations on similarity with respect to fuzzy set theory, ontological semantics, and fuzzy measure theory are next described with an initial experiment comparing the similarity of 21 annotated genes products [8] using several combinations of the QUOTA similarity components. Then a synopsis of the current querying capabilities of QUOTA is presented. The presentation concludes with a discussion of current and future research plans and solicits ideas for collaborative research on how to adapt or enhance QUOTA for wide spread use in computational biology.
[1] Gene Consortium, http://www.geneontology.org/
[2] X. Guo, C. D. Shriver, H. Hu, M.N. Liebman, “Semantic similarity based validation of human protein-protein interactions,” Proc. Computational Systems Bioinformatics Conference, pp. 149-150, 2005.
[3] M. Popescu, J. Keller, J. Mitchell, and J. Bezdek, “Functional Summarization of Gene Product Clusters Using Gene Ontology Similarity Measures”, Proc. Int. Conference on Intelligent Sensors, Sensor Networks and Information Processing, Melbourne, Australia, December, 2004, pp. 553-559.
[4] F. Azuaje, H. Wang, and O. Bodenreider, “Ontology-driven similarity approaches to supporting gene functional assessment,” In Proceedings of the ISMB’2005 SIG meeting on Bio-ontologies 2005:9-10.
[5] M. Zhou and Y. Cui, “GeneInfoViz: constructing and visualizing gene relation networks.,”Silico Biol. 4(3):323-33, 2004.
[6] C. A. Joslyn, S. Mniszewski, A. Fulmer, G. Heaton, “The Gene Ontology Categorizer,” Bioinformatics. Aug 4;20 Suppl 1:I169-I177, 2004.
[7] Y. Sun, “Querying with Ontological Terminologies and their Annotations,” Masters Thesis May, 2007, Computer Science and Systems Analysis, Miami University, Oxford, OH.
[8] M. Popescu, J. Keller, and J. Mitchell, “Fuzzy Measures on the Gene Ontology for Gene Product Similarity,” IEEE/ACM Transactions on computational biology and bioinformatics, vol. 3, no. 3, pp. 263-274, July/Sept 2006.
Rachana Jain, Department of Biomedical Engineering, University of Cincinnati
Michael Wagner, Division of Biomedical Informatics, Cincinnati Children’s Hospital Research Foundation
Peptide Mass Fingerprinting (PMF) has increasingly gained acceptance as a primary and fast method for protein identification since the early 1990s. PMF is based on the principle that masses of the constituent peptides of a protein may provide a unique fingerprint/map which can be used to identify the protein by comparison with a database of theoretical protein digests. However, various factors, such as the presence of contaminants in the sample, limited databases and post-translational modifications of proteins complicate the task for PMF and limit its success as a protein identification method.
The crucial ingredient in PMF methods is the definition of a scoring function which can accurately distinguish between random hits and true positives. Current database search tools such as MASCOT and ProFound use the number of matches (hits) between experimentally determined peptide masses and the theoretical digest of a database protein as the primary parameter in their scoring functions. Our work focuses on systematically evaluating a number of quality measures, (some of which are novel) that measure the degree to which an experimental peak list matches a theoretical digest.
One novel quality measure we investigate here is based on the non-parametric Kolmogorov-Smirnov test. We propose finding the peptide in the theoretical digest that is closest in mass for each mass spectral peak. We then compare the resulting cumulative mass error distribution to a background distribution of false-positive proteins of similar size and compute the one-sided non-parametric Kolmogorov-Smirnov (KS) statistic as a score to indicate how different this distribution is from a random distribution.
Using publicly available curated PMF datasets from yeast, we compared the relative performance of the KS score to the simpler statistic of the number of hits given a particular mass tolerance. KS ranked 266 of 313 proteins correctly at the top 1 position, when searched against a database of 3795 non-redundant proteins, outperforming all other quality measures. By comparison, the score based on the number of peptide matches only identifies 198 proteins correctly. Furthermore, decision trees trained on the same data sets consistently identified the KS score as the feature with the maximum information gain.
These results, while still preliminary and on a limited dataset, demonstrate that the KS test outperforms traditional measures in identifying the correct protein. We propose that the KS score, especially when coupled with other features and machine-learning type algorithms (which we are currently exploring), has the potential of improving upon the current state of protein identity prediction using PMF. Furthermore, we note that the methodology is extensible to MS/MS data.
Alex Kloft, Yuansheng Liu, Lin Liu and Chun Liang, Department of Botany, Miami University, Oxford
dbEST is the most rapidly growing database dedicated to expressed sequence tag (EST) sequences. As of May 25, 2007, there are 43,342,964entries deposited in dbEST, covering 1,320 different species of model ornon-model organisms. While EST data are being widely used in many genome characterization approaches, including gene discovery and gene expression profiling, polymorphism detection and genomic sequence annotation, they represent the most serious challenge in data veracity. Due to imperfections in molecular biology manipulation during cDNA library construction and errors in sequencing procedures, it is estimated that about 3% base ambiguity rates and spurious sequence contaminations exist in EST sequences. For a long time, no bioinformatics program has been developed to explore systematically and comprehensively the data abnormality in ever-growing, enormous dbEST data.
Recently, we published our WebTraceMiner (http://www.conifergdb.org/software/wtm), a unique public web service for processing and mining raw EST sequencer trace files by focusing on the sequence features that characterize and annotate either 3¹ and/or 5¹ termini of cDNA inserts. Using WebTraceMiner, we have reprocessed 172,229 loblolly pine EST trace files downloaded from NCBI Trace Archive, and created the ConiferEST database (http://www.conifergdb.org/coniferEST.php), the first public EST resource that allows biologists to explore both the complexity and abnormality of ESTs in terms of terminus structures. It is clear to us that terminus determination is important to data quality control and validation of error-prone ESTs, and double-termini adapters appeared to be good indicators for EST chimeras. In this study, we extend our research to whole dbEST sequence data, based on the assumption that many cDNA libraries have adopted the same or similar construction protocol using EcoRI and XhoI as restriction enzyme sites. Among all 43,342.964 entries, we detected about 0.72% sequence reads that have either 5¹ terminus in sense strand (5TSS), 3¹ terminus in sense strand (3TSS, containing a polyA tail), 5¹ terminus in non-sense strand (5TNS, containing a polyT tail) or 3¹ terminus in non-sense strand (3TNS) in perfect matching patterns, while about 0.13% sequence reads have double-termini adapters. If one base error is allowed in pattern matching, the aforementioned numbers will be 2.16% and 0.17% respectively.
We concluded that many sequence reads in dbEST can be cleaned by determining unambiguously their terminus structures and extracting accurately their final cleaned sequences. EST termini information will definitely help identify and highlight EST chimeras, as well as other abnormalities, to reduce cascaded and deleterious impacts of spurious and chimeric sequences existing in dbEST on many downstream EST analyses
(e.g., NCBI UniGene).


International Society for Computational Biology grants affiliate status to the Ohio Bioinformatics Consortium
Ohio Regional Student Group
![]()
Click on the links below for abstracts of the workshops and tutorials presented at OCCBIO 2007.
Session Abstracts
Tutorials
![]()