
Session 1a. Genomic Variation and Proteomics
Nora L. Nock, Li Li and Robert C. Elston. Modeling Genetic and Environmental Factors in Biological Systems Using Structural Equation Modeling: An Application to Energy Balance
To improve our understanding of the role(s) that genes and environmental factors play in a complex disease, we need statistical approaches that model multiple factors simultaneously in a hierarchical manner that aims to reflect the underlying biological system(s). We present an approach that models genes as latent constructs, defined by multiple variants (single nucleotide polymorphisms, SNPs) within each gene, using the multivariate statistical framework of structural equation modeling (SEM) to model multiple, putative genetic and environmental factors involved in energy imbalance (‘obesity’) using subjects from a colon polyp case-control study. We found that modeling constructs for the leptin receptor (LEPR) gene (defined by SNPs rs1137100, rs1137101, rs1805096, rs6588147) and the fat mass-and-obesity-associated (FTO) gene (defined by SNPs rs9939609, rs1421085, rs8044769) together with demographic (age, race, gender), physical activity, diet and sleep variables increased the strength of the association (ßstd=-0.13 ± 0.06; p=0.03) between the FTO and obesity constructs compared to that observed in a reduced model with only the FTO and LEPR constructs and demographic variables (ßstd=-0.05 ± 0.03; p=0.08). Several indirect paths, including an association between the LEPR and physical activity constructs (ßstd=-0.15 ± 0.04; p=0.01), were found. Interestingly, removing FTO revealed a marginal association between the LEPR and obesity constructs (ßstd=0.24 ± 0.14; p=0.09), which was not present when FTO was in the model. These results illustrate the importance of modeling multiple relevant genes and other factors in the same model, which is a major strength of this approach. Moreover, our latent gene construct approach exploits the correlation structure between SNPs while capturing overall effects of variation in that gene, which will enable better utilization of candidate gene and genome-wide SNP array data.
Jing Li and Kanchana Narayanan. Visulization and Functional Analaysis of Genome-Wide Association
Genome-wide association studies (GWAS) provide a new and powerful approach to investigate the effect of inherited genetic variation on risks of complex diseases. With recent advances in genotyping technology, genome-wide association studies are now becoming a reality. Within the past two years, scientists have successfully replicated genetic risks of several complex diseases including cancers, obesity, and type 2 diabetes using GWAS. And more data from GWAS are expected in an accelerated rate. However, management, analysis, visualization, and interpretation of genome wide association data are particularly difficult, primarily because GWAS may consists of hundreds of thousands SNPs (single nucleotide polymorphisms) from thousands individuals. This paper describes the features and implementation of a web application tool named MAVEN for Management, Analysis, Visualization and rEsults shariNg of GWA data using cutting edge technologies. In addition, MAVEN seamlessly integrates users own data with databases at the National Center for Biotechnology Information (NCBI), which allows users to directly obtain functional annotations of SNPs and genes that are relevant to their own research interests. Therefore, MAVEN can effectively facilitate the functional analysis of GWAS.
Chao Yuan, Gaurav Rana, Jinsook Chang, Rob Ewing and Mark Chance. Comparison of label free and 18O labeling mass spectrometry in relative protein quantification
Mass-spectrometry-based quantification methods have been increasingly applied to measure proteomic changes in biological systems between different physiological states. In this report, we compared two popular mass-spectrometry-based quantification strategies, stable isotope labeling and label free approaches. We spiked known amounts of standard peptides into a complex biological sample and analyzed this mixture with both stable isotope 18O labeling and label free mass spectrometry methods. We optimized data pre-processing and normalization algorithms for each method, and compared their sensitivities and accuracies. We found that both methods gave relatively accurate results, and the label free methods provided higher proteome coverage.
Session 1b. Bioinformatics of Disease. Organizers: Bruce Aronow (UC), Anil Jegga (CCHMC) and Phillip Payne (OSU)
Philip Payne et al. Conceptual Knowledge Engineering Approaches to High-throughput Phenotyping.
Bruce Aronow et al. Applied Diseaseomics: Assembling Gene and Disease-Related Feature Networks to Enable Disease-Specific Inferential Reasoning for Causal Factors and Candidate Therapeutics.
Keith Marsolo. Construction of a Clinical Data Warehouse and Mining Environment:The CCHMC i2b2 Project.
Session 2c. Massively Parallel Sequencing. Organizers: Kun Huang (OSU) and Tea Meulia (OARDC)
Hatice Gulcin Ozer, Terry Camerlengo, Tao Zuo, Tim Huang and Kun Huang. A New Method for Mapping Short DNA Sequencing Reads by Using Quality Scores
New high-throughput sequencing technologies can generate millions of short DNA sequences that need to be mapped to the reference genome accurately. Majority of the mapping algorithms handle variations in the quality of these short sequences by allowing more mismatches and/or gaps in the alignment and focus to improve runtime. In this paper, we investigate ways to classify quality scores of short DNA sequencing reads and integrate them into the mapping process. We specifically studied the quality scores that suggest two alternate bases (the top quality scores for two bases are close to each other at the locus) and use of such bases to improve mapping accuracy.
Our method includes generation of alternative sequences when there are alternate-quality bases in a sequence read and mapping of these alternative sequences to the reference genome. In a test using a piece of ChIP-seq data from epigenetic study, we generated and mapped alternatives of 222,755 sequence reads (out of the original 2.5 million reads) that cannot be mapped to the reference genome by the Eland algorithm. With this approach we could be able to map 12.8% of these sequence reads with alternative bases to unique positions in the genome.
In this study, we demonstrate that use of alternative bases in mapping algorithms can improve mapping results dramatically.
Xin Wang, Mingxiang Teng, Guohua Wang, Yuming Zhao, Weixing Feng, Lang Li, Jeremy Sanford and Yunlong Liu. xIP-seq platform: an integrative framework for high-throughput sequencing data analysis
We report an integrated bioinformatics platform for next-generation sequencing data analysis - xIP-seq module in GenePattern package. The motivation of this tool is to provide an analysis pipeline for the data derived from a variety of experiments that uses immunoprecipitation following next generation sequencing technology. These experiments include ChIP-seq (chromatin immunoprecipitation following sequencing) for analysis of DNA binding protein, CLIP-seq (cross-linking immunoprecipitation following sequencing) for analysis of RNA binding protein, and MeDIP-seq (methylation DNA immunoprecipitation following sequencing) for DNA methylation profiles. xIP-seq platform attempts to provide standardized data pre-processing workflows and several bioinformatics modules for tertiary analysis. In their applications, these modules as well as new functions added by users can be easily integrated into pipelines. They provide automatic solutions for specific biological questions. As an example, a “CLIP-seq Mapper” pipeline is created to map CLIP-seq-derived data to human genome to identify the genome-wide annotation of SFRS1 protein’s binding patterns. The implementation demonstrates the significance for the xIP-seq module to handle massive amount of next-generation sequencing data, and provide automatic, flexible, and intelligent enterprise-level bioinformatics solutions. Availability: http://watson.compbio.iupui.edu:8081/gp
Terry Camerlengo, Haltice Gulcin Ozer, Guojuan Zhang, Tarek Joobeur, Tea Meulia, Joanne Trgovcich and Kun Huang. Computational challenges and solutions to the analysis of microRNA profiles in virally-infected cells derived by massively parallel sequencing
In this paper we report an ongoing project for identifying HCMV miRNA expressed in infected human cells using the new massive parallel sequencing technology with the Solexa Sequencer. We developed a data processing pipeline for analyzing such data including mapping segments to genomes, detecting highly expressed sequences and their loci, comparing sequences to existing databases and selecting candidate miRNAs for experimental validation. We identified 114 putative virally-derived miRNAs with high expression levels that included 9 out of 10 known HCMV miRNAs, partially validating our methods. This observation also suggested that other identified sequences with high level of expression are potential miRNAs and this method is an effective way of discovering new small regulatory RNAs. Validation of putative novel viral miRNAs are underway, as are efforts to identify primary transcripts or introns from which they are derived. Future directions include designing the most statistically robust selection criteria, designing methods to measure viral-induced changes in the human miRNA expression profile, and identifying the targets of the miRNAs in the viral and human genomes.
Xiaodong Bai and Parwinder Grewal. TIGERA: A new Tool for Illumina Gene Expression Reads Analysis
Next-generation sequencing platforms, including Illumina, 454, and SOLiD are emerging as easier, faster, and cheaper alternatives to traditional sequencing platforms. Illumina digital gene expression (DGE) tag profiling allows comprehensive analysis of differentially expressed genes in organisms. Computer programs are necessary to handle the overwhelming amount of data generated by the Illumina Genome Analyzer. Here we report the design and implementation of a program for the analysis of differential gene expression based on Illumina data. The program TIGERA (Tool for Illumina Gene Expression Reads Analysis) was written in perl utilizing newly-implemented and preexisting algorithms with a simple graphical user interface. The program performs the following tasks automatically after the required inputs are provided. The expression levels of high-quality Illumina tags for each of the two groups of libraries are determined and normalized as transcript per million (TPM). The Illumina tags are mapped to the annotated reference sequences to identify uniquely mapped tags. The mapping results are validated using information generated by digital restriction enzyme digestion of the reference sequences. Based on whether the tags matched to unique or multiple reference sequences after validation, the tags are grouped in three categories: one tag-one reference, one tag-one gene, and one tag-multiple genes. The tags within the first two categories are analyzed further to determine the reference sequences that contain unique expression levels or have potential alternative transcript splicing products. A Poisson mixture model is applied to analyze the differential expression of reference sequences with unique expression levels and the tags not being matched to the reference sequences. The progress of the analysis is monitored and reported. The analysis results are presented as text files and also deposited in a MySQL database that can be visualized and searched in Internet browsers. Two biological replicates of the DGE tag libraries of the infective juveniles of the entomopathogenic nematode Heterorhabditis bacteriophora TT01 and GPS11 strains were sequenced using Illumina platform to demonstrate the performance of the program.
Session 2d. RNA: Master Regulator of the Genome. Organizers: Neocles Leontis (BGSU) and Alexei Fedorov (UT)
Anton I. Petrov, Jesse Stombaugh, Craig L. Zirbel and Neocles B. Leontis. IsoDiscrepancy Matrices for Non-Watson-Crick Basepairs: Tools for RNA Structural Bioinformatics
Much of the eukaryotic genome is transcribed and there is evidence that many of the RNAs produced are structured and biologically functional. Most of the transcribed RNAs do not code for proteins and identification of such non-coding RNAs (ncRNA) in genomes has become a focus of intense research. Related areas of research include improvement of methods for predicting RNA secondary and three-dimensional structures from sequence. Structure prediction is considered a crucial step in identifying the function of a new RNA. Knowledge of RNA structure is also crucial for designing effective genomic search and sequence alignment algorithms for ncRNAs. Many of the nominally single-stranded hairpin, internal, and junction “loop” regions of RNA secondary structures in fact form uniquely folded 3D motifs. These elements are largely structured by non-Watson-Crick basepairs. Many 3D motifs are recurrent, meaning they occur in different RNAs. Recurrent motifs have the same 3D structure but not necessarily the same sequence. Since the database of RNA 3D structures now contains a significant number of biologically active, structured RNAs, including ribosomal RNAs, ribozymes, and riboswitches, we can identify the sequence variability for recurrent motifs in x-ray crystal structures. We describe a methodology for identifying the sequence variability of a given RNA internal loop. We use our search program, FR3D, to search the 3D structure database for geometrically similar motif instances that share the same pattern of basepairs. By comparing instances we determine the most likely locations of inserted ("bulged") nucleotides. We apply our analysis of RNA basepair isostericity and occurrence frequencies to suggest likely basepair substitutions. In particular, we recently introduced the IsoDiscrepancy Index (IDI) to quantify basepair isostericities and derived 4x4 IDI Tables for each base combination in each basepair family. We illustrate how these tables can be applied to predict the most likely base substitutions that occur in a 3D motif and to construct probabilistic models for the motif.
William Ray, Hatice Ozer, David Armbruster and Charles Daniels. Beyond identity - When classical homology searching fails, Why, and What you can do about it
Multiple Sequence Alignments of both protein and nucleic-acid sequences are a ubiquitous method for modeling sequence families that pervades every biological domain. Despite their utility, MSAs and methods derived from them fail to capture interpositional relationships that can be as critical to family membership as are positional identities.
We have recently developed novel methods, MAVL and StickWRLD, to quantitate and visualize additional features of sequence family models, and have identified interpositional dependencies at the residue level that are critical indicators of family membership in many sequence families. Some of these dependencies cannot be modeled by any existing modeling method, including Hidden Markov Models. In certain cases, the dependencies are sufficiently strong that all common methods score sequences that are explicitly excluded from the family, as better candidates than any actual members.
The tRNA intron-endonuclease targets in the Archaea are such a family. Originally characterized as excised introns from archaeal tRNAs, some of which function as guide RNAs to target O-methylation of the ribosomal RNAs, these sequences have a very short characteristic signature and allow significant divergence. There is insufficient information in the base conservation to create useful scoring models. Using our tools we have identified critical residue interdependencies within the endonuclease target that enable detection of introns in whole-genomic sequence. Many of these introns occur outside tRNAs, including some that are excised from protein mRNA. The dependencies we identify correspond to a Markov network of relationships over the positional identities. The contribution of each node's Markov blanket is incorporated via blending with the positional conservation using a voting algorithm. In this paper we present the results of this analysis and the generalization of our modeling method to arbitrary RNA families. This generalization allows development of models of similar power for arbitrary RNA families.
Robert Forties and Ralf Bundschuh. Modeling nucleic acid structure in the presence of single-stranded binding proteins
There are many important proteins which bind single-stranded nucleic acids, such as the nucleocapsid protein in HIV, the RecA DNA repair protein in bacteria, and all proteins involved in mRNA splicing and translation. We extend the Vienna Package for quantitatively modeling the secondary structure of nucleic acids to include proteins which bind to unpaired portions of the nucleic acid. All parameters needed to model nucleic acid secondary structures in the absence of proteins have been previously measured. This leaves the footprint and sequence dependent binding affinity of the protein as adjustable parameters of our model. Using this model we are able to predict the probability of the protein binding at any position in the nucleic acid sequence, the impact of the protein on nucleic acid base pairing, the end-to-end distance distribution for the nucleic acid, and FRET distributions for fluorophores attached to the nucleic acid.
Prakash et al. Computation of putative targets for human and mouse snoRNAs responsible for Prader-Willy Syndrome.
Session 3e. Regulatory Genomics. Organizers: Lonnie Welch (OU) and Erich Grotewold (OSU)
Alper Yilmaz, Saranyan Palaniswamy, Ramana Davuluri and Erich Grotewold. Discovery of Regulatory Networks in Plants by Linking Promoter and Transcription Factor Databases
Transcription factors (TFs) control the levels as well as the sites and times of expression of a discrete set of target genes by binding to specific cis-regulatory elements in the corresponding promoter regions. They can function as master control switches for the regulation of metabolic pathways, cell differentiation and the cell cycle. Thus, the state of a living cell is the result of regulated transcription of thousands of genes in which TFs are major players.
A first step in the discovering the regulatory networks is to establish the organization of cis-elements in promoters and the direct targets of TFs. Towards this goal, our lab has developed two publicly available TF and promoter databases. AGRIS (arabidopsis.med.ohio-state.edu) is dedicated to reveal regulatory networks in Arabidopsis and is currently composed of databases of putative cis-elements (AtcisDB) and TFs (AtTFDB). The regulatory network in Arabidopsis is constructed based on available data by linking cis-regulatory elements and transcription factors, interactions that are visualized by AtRegNet.
GRASSIUS (grassius.org) provides regulatory information gathered from computational and experimental sources for the grasses, initially including maize, rice, sorghum and sugarcane. Promoter sequences across these grasses and cis-elements important for gene expression are gathered in GrassPROMDB. GrassTFDB contains information on TFs, their DNA-binding properties and the genes that they have been experimentally demonstrated to bind/regulate. GrassREGNET is the ultimate component of GRASSIUS, currently under development, and will provide a dynamic relationship between the contents of GrassTFDB and GrassPROMDB in the light of experimentally verified interactions, helping visualize spatio-temporal gene regulation and regulatory networks.
Jens Lichtenberg, Mohit Alam, Thomas Bitterman, Frank Drews, Klaus Ecker, Laura Elnitski, Susan Evans, Erich Grotewold, Dazhang Gu, Edwin Jacox, Kyle Kurz, Stephen S. Lee, Xiaoyu Liang, Pooja M. Majmudar, Paul Morris, Chase Nelson, Eric Stockinger, Joshua D. Welch, Sarah Wyatt, Alper Yilmaz, Lonnie R. Welch and Matt Geisler. Construction of Regulatory Encyclopedias of Genomes: Strategies and Case Studies
Encyclopedias of regulatory genomic elements provide a foundation for disease diagnosis, disease treatment and performance enhancement. The construction of complete encyclopedias of organism-specific genomic elements involved in gene regulation remains a challenge. A bioinformatics tool for automatic discovery of regulatory genomic elements is presented, and the applicability of the tool is shown through several case studies. The tool is available at http://word-seeker.org.
Jeffrey Parvin. Identification of a breast cancer associated regulatory network
This project tests a new framework for discovery of genes involved in the breast carcinogenesis process. Among families that have a predisposition to breast cancer, approximately 25% have inherited mutations in either breast cancer associated (BRCA) genes BRCA1 or BRCA2, but the predisposing mutated genes in the majority of the families are unknown. BRCA1 and BRCA2 gene products both regulate cellular pathways that involve DNA repair and centrosome duplication, and their expression is correlated in microarray analyses in many cell types. We hypothesize that other unidentified BRCA genes may be involved in the same pathways that BRCA1 and BRCA2 regulate, and thus may be discovered by identifying genes whose expression also is correlated with that of BRCA1 and BRCA2. We interrogate public-domain gene expression databases using newly developed computational tools that include combinatorial and algebraic clustering methods to identify genes whose expression correlates with these tumor suppressors. Identified genes are then tested in the laboratory. RNA interference is used to disrupt the expression of the candidate BRCA gene products in two cell-based assays that are dependent on BRCA1 and BRCA2 expression. The first assay models the regulation of homology-directed recombination repair of double-strand DNA breaks, and the second assay tests the control of duplication of the centrosome. We have selected nine genes that tightly cluster with BRCA1 and BRCA2 expression in multiple datasets, and these nine genes have never before been linked with the two reference genes. When tested in the lab using RNA interference to deplete the specific protein, six of these genes were found to affect homologous recombination and four affected the regulation of centrosome number. If the informatics analysis is considered a screening tool to find genes/proteins involved in breast carcinogenesis, then this approach has an extremely high success rate in finding proteins that impact phenotypes regulated by BRCA1 and BRCA2. In summary, we employ a novel experimental framework that develops new bioinformatic tools for identifying candidate genes whose regulation suggests the potential for involvement in breast carcinogenesis, and we validate the gene in the lab. This experimental framework may also be applicable to the identification of networks of genes involved in common pathways in other disease processes.
Xiangjia Min, Gregory Butler, Reginald Storms and Adrian Tsang. Comparative Assessment of DNA Assemblers for Assembling Expressed Sequence Tags
Assembling expressed sequence tags (ESTs) is essential for removing redundancy and generating long virtual transcripts for EST annotation and gene finding. A number of assemblers are available, but there is a lack of detailed comparative assessment of the strength and weakness of these assemblers. We compared three assemblers including Phrap, CAP3 and TIGR Assembler (TA) using Aspergillus niger and Phanerochaete chrysosporium EST data. Phrap assembled more ESTs into contigs than TA and CAP3. Among the contigs and singletons generated by the three assemblers, 67 – 90% of them were identical. The number of contigs and singletons assembled by Phrap provides an approximate estimate of unique gene number, while the numbers generated by TA and CAP3 provide an approximate estimate of unique transcripts since both TA and CAP are more discriminating to alternatively spliced transcripts. The error rate in contigs generated by Phrap was slightly higher than contigs generated by TA or CAP3. Phrap is thus recommended for EST assembling aiming at generating a set of unisequences with minimum redundancy for estimating the unigene number, and TA or CAP3 are used for assembling ESTs aiming at finding unique transcripts, i. e., for identification of alternative splicing.
Amanda Hanes, Michael Raymer, Travis Doom and Dan Krane. A comparison of codon usage trends in prokaryotes
Codon usage bias is an effective measure of the differences among organisms at a genomic level. These genomic differences also reflect some differences in the organisms’ lifestyles and physiology. Here we demonstrate that prokaryotic obligate intracellular parasites and symbionts have a codon usage pattern that differs significantly from that of exclusively free-living prokaryotes. This result is valuable in that it suggests that the habitat of an organism may directly influence that organism’s use of synonymous codons, which in turn demonstrates evidence of an evolutionary mechanism that operates at a finer molecular level than that of amino acids and proteins.
Shannon Steinfadt and Kevin Schaffer. Parallel Approaches for SWAMP Sequence Alignment
This document is a summary and overview of several approaches to implement the local sequence alignment algorithms known as SWAMP and SWAMP+ on commercially available hardware. Using a Smith-Waterman style of alignment, these parallel algorithms have several innovative extensions that take advantage of the ASC associative computing model, while maintaining speed, accuracy, and they produce a richer set of results in an automated way that is not currently available.
We consider four different hardware architectures for the realization of the ASC model. These are the ClearSpeed CSX processor, the NVIDIA GPGPU graphics processors, FPGAs and the IBM Cell Processors.Session 4g. Evolution of Genomes and Origins of Species. Organizers: Dan Janies (OSU) and Helen Piontkivska (KSU)
Mark D. Adams. The evolution of multidrug resistance in a hospital pathogen
The recent emergence of multidrug resistance (MDR) in Acinetobacter baumannii has raised concern in healthcare settings worldwide. In order to understand the repertoire of resistance determinants and their organization and origins, we compared the genome sequences of three MDR and three drug-susceptible A. baumannii isolates. The entire MDR phenotype can be explained by the acquisition of discrete resistance determinants distributed throughout the genome. A resistance island (RI) with a variable composition of resistance determinants interspersed with transposons, integrons, and other mobile genetic elements is a significant, but not universal, contributor to the MDR phenotype. Variable resistance gene composition among identical clone types from a single outbreak suggests dynamic and active horizontal transfer. 475 genes are shared among all six clinical isolates, but absent from the related environmental species Acinetobacter baylyi ADP1. These genes are enriched for transcription factors and transporters and suggest physiological features of A. baumannii that are related to adaptation for growth in association with humans.
Daniel Janies, Travis Treseder, Boyan Alexandrov, Farhat Habib, Jennifer Chen, Renato Ferreira, Ümit Çatalyürek, Andrés Varón and Ward Wheeler. The Supramap project: tracing the spread of pathogens over time, space, and various hosts
Emerging infectious diseases present critical issues of public health and economic welfare. As demonstrated by the response to Severe Acute Respiratory Syndrome (SARS) and influenza, novel diseases are being addressed via rapid genomic sequencing. However, our ability to make sense of these data lags behind acquisition. First, sequence alignment and phylogenetic analyses of large datasets comprised of many genomes are computationally difficult. These operations require novel algorithmic approaches, large amounts of memory, and parallel architectures. Next, even once satisfactory phylogenetic trees are produced, science has just begun to understand how disease-causing organisms evolve and travel over various hosts and geography to become epidemics.
We have developed a user friendly web based application called Supramap (http://supramap.osu.edu). The application supports a workflow connecting analysis of raw sequence genomic sequence data with geographic information systems (GIS). In the supramap web site we have created an interface where users can register, upload raw data files, name projects, and organize sets of data files into jobs to be executed on a computing cluster.
Using Supramap we have created interactive genomic and geographic maps. These maps employ phylogenetic trees and GIS to reconstruct the evolution and spread of avian influenza (H5N1) and other diseases.
In contrast to syndromic surveillance, Supramap is novel since it puts genetic and functional data on pathogens and hosts in an evolutionary and geographic context. Use of these contexts in concert allows the user to integrate data on pathogen diversity with any other data layer of interest such as: underlying host distribution, population data, transit systems or environmental conditions. The user can focus on geographic areas or time points of interest to study regionally specific mutations that allow pathogens to jump from animals to humans, confer resistance to drugs, or cause deadly or debilitating infections - all key sources of intelligence that allow for proper forecasting, diagnosis and response to an outbreak
Paul Morris, Alexandra Schmucker and Nicole Vanduzen. Computational prediction of the oomycete interactome
Computational predictions of protein-protein interactions are now being applied to predict associations that are conserved across the major eukaryotic lineages of plants animals and fungi. These predictions rely on identifying orthologous pairs of proteins that have been experimentally verified to interact in other organisms. Comparisons across these kingdoms may be robust for many processes because there has been very little horizontal exchange of domains after separation of these Kingdoms. In contrast, analysis of diatom and oomycete genomes have revealed multiple examples of horizontal transfer events in these genomes. These include both transfer of genes to the host nucleus from the photosynthetic endosymbiont in the common ancestor of oomycetes and diatoms, along with the acquisition of bacterial genes. Acquisition of DNA from bacterial genomes has occurred both prior to and subsequently to the separation of the ancestral genomes of diatoms and oomycetes. To assess the importance of horizontal transfer events in shaping new metabolic networks in oomycete genomes, we determined the phylogenetic origins of all the domains in the sulfate assimilation, and serine and lysine biosynthetic pathways. Phylogenetic analysis of the proteins in these pathways revealed multiple examples of horizontal transfer from both bacterial genomes and the photosynthetic endosymbiont in the ancestral genome of oomycetes. Since the proteins of these pathways do not share a common phylogenetic origin, the regulation of these pathways may be unique to oomycetes. Regulatory networks within oomycetes may therefore be expected to be equally complex. In spite of these inherent problems, a predicted interactome is a valuable first step to help map out conserved signaling networks in oomycete genomes. We will report on our use of the predicted Arabidopsis interactome, and yeast and human interaction databases in combination of the reciprocal smallest distance algorithm to help identify conserved networks in oomycete genomes.
Helen Piontkivska and Sinu Paul. HIV evolution: fast or slow?
It is known that in the human immunodeficiency virus (HIV) genome regions responsible for interactions with the host’s immune system, namely, cytotoxic T-lymphocyte (CTL) epitopes, tend to be clustered together, sometimes found in more conserved parts of genome. On the other hand, more variable regions tend to have lower density of CTL epitopes. Furthermore, high recombination rate, coupled with high mutation rate, results in overall high diversity of HIV sequences, which is expected to result in little if any association between different regions of a genome. Or is it not? Employing data mining technique, we show that indeed some rather strong associations between different regions of HIV genome can be detected, even when circulating recombinant forms are considered. This can partly be attributed to strong functional constraints acting on protein sequences and certain CTL epitopes regions in particular.
Session 4h. Bioimaging. Organizers: Jundong Liu (OU), David Wilson (CWRU) and Metin Gurcan (OSU)
Jundong Liu, Automatic Multiple Sclerosis Detection based on L2E Measure
Metin Gurcan, Clinical Image Analysis
Don Stredney, Recent trends in Biomedical Visualization
Advanced imaging technologies are being rapidly diffused throughout the many scales involved in systems biology. With the increasing capture of data under different modalities, scale, and follow-up studies, users are rapidly facing being deluged by imaging data. There is a growing need to process extremely large imaging sets to derive salience and gain insight from mounting sources. Recent developments in graphical processing units (GPU's) provide accelerated out-of-core processing. GPU's provide a low-cost solution that provides a tractable methodology to provide accelerated processing power for image data. In addition, visualization techniques at the desk-side level provide unprecedented processing power to drive increasingly complex simulations such as real-time surgical simulations. This presentation will cover the current approaches to large-scale visualization at OSC. Hardware configurations will be presented with examples of remote rendering exploiting GPU clusters. In addition, examples that exploit the GPU power at the desk will also be presented, including medical and veterinary education and surgical simulations being used in multi-institutional trials.
Session 5i. Systems Biology and Networks
Kurtis Eisermann, Adina Brett, Anton Bazarov, Ethan Knapp, Helen Piontkivska and Gail Fraizer. Uncovering androgen responsive regulatory networks in prostate cancer
An important goal for prostate cancer therapy is to identify novel mechanisms of androgen signaling that may provide new targets for androgen blockade therapy. Androgen regulated target genes continue to be identified, and include genes with regulatory regions containing 1) classical dimeric androgen receptor elements (ARE), or 2) sites for other transcription factors (TFs) that tether AR to a regulatory region lacking ARE sites, or 3) non-canonical ARE half-sites. The latter category of half-sites is becoming increasingly important, because up to 80% of potential AR regulatory regions identified by ChIP-chip technology contain these monomeric half-sites. Determining which of these predicted target genes and androgen pathways are functional is very important, as they contribute to our understanding of prostate cancer progression. Microarray analyses were used to identify genes expressed in laser captured prostate cancer epithelial cells, leading to identification of pathways of co-regulated genes. It is expected that important regulatory regions would be conserved between mammalian genomes, thus, comparative evolutionary analyses were used to identify evolutionary conserved TF binding sites. Notably, non-canonical ARE half-sites were identified in a majority of the gene promoters analyzed, and these sites were adjacent to evolutionary conserved zinc finger transcription factor sites. Subsequent ChIP assays showed that indeed SP1, WT1 and AR all bind to a common regulatory region, indicating potential for interaction between these TFs that in turn can modulate hormone responsiveness. Overall, our bioinformatics screening coupled with experimental validation has revealed critical components of regulatory networks important in prostate cancer cells and disease progression.
Xin Li, Sinan Erten, Gurkan Bebek, Mehmet Koyuturk and Jing Li. Comparative Analysis of Modularity in Biological Systems
In systems biology, comparative analysis of molecular interactions across diverse species indicates that conservation and divergence of networks can be used to understand functional evolution from a systems perspective. A key characteristic of these networks is their modularity, which contributes significantly to their robustness, as well as adaptability. In this paper, we investigate the evolution of modularity in biological networks through phylogenetic analysis of network modules. Namely, we develop a computational framework, which identifies modules in networks of diverse species indepedently and projects these modules into the networks of other species, with a view to capturing the evolutionary trajectories of functional modules. These trajectories can then be used to reconstruct modular phylogenies and whole-network phylogenies, or to enhance identification of functional modules. In the context of phlogeny reconstruction, our experiments on a comprehensive collection of simulated and real networks show that comparison of networks based on module trajectories is more informative than other measures of network similarity. These results demonstrate the key role of modularity in the functional evolution of biological systems and motivate further investigation of the evolution of functional modules.
K.J. Abraham, Katrin Sameith and Francesco Falciani. Improving Functional Module Detection
There has been a great deal of recent interest in identifying functional modules from protein interaction and gene expression data. One commonly used computational technique which while asymptotically correct frequently suffers from slow convergence. In this paper we outline and exploit the analogy between finding functional modules and finding Haplotype Blocks from genetic data, to investigate a new technique for identifying functional modules which does not rely on Monte carlo methodology. We discuss circumstances under which our algorithm may work but under which simulated annealing may not converge to known modules. We also suggest how our methodology might supplement and improve the performance of existing Monte Carlo searches.
Session 5j. Ohio student chapter of the International Society of Computational Biology (ISCB) Session
Jens Lichtenberg, Ohio University. Regulatory Genomic Signatures
Vivek Kaimal, University of Cincinnatti. Microregulation of biological pathways and networks
Praveen Kumar Raj Kumar, Miami University. Bioinformatic analyses of alternative splicing in Chlamydomonas reinharditii
Vishal Patel, Case Western Reserve University. Molecular Synergy Between Cancer Driver Genes Revealed in Protein-Protein Interaction Networks
Anton Petrov, Bowling Green State University. Building a Comprehensive Collection of RNA 3D Motifs
Van Anh Tran, Case Western Reserve University. Formal Concept Analysis in assessing genetic ancestry
Zidian Xie, The Ohio Sate University. Genome-wide identification of FLP/MYB88 putative direct targets in Arabidopsis
Yali Li, Case Western Reserve University. Detecting Association with Rare Genetic Variants in Common Diseases
Gokhan Yavas, Case Western Reserve University. Copy Number Variant Identification using Objective Function Optimization
Jessica Bates, University of Akron. Identification of nectar yeasts in Claytonia virginica using molecular approaches
Session 6k. Applications of Bioinformatics
Esley Heizer, Doug Raiford, Dan Krane and Michael Raymer. Perceived Cost of Auxotrophic Amino Acids in Two Bacterial Species
Amino acid biosynthetic pathways are highly conserved throughout all domains of life. Biosynthesis of amino acid requires the diversion of resources from energy production to amino acid production. The consequent energy-cost of producing an individual amino acid is can be estimated by adding the amount of ATP expended in production itself to the amount of potential energy lost. Some organisms lack the metabolic pathways required for the synthesis of some or all of their amino acids and must obtain them from their environment or their host organism. The energetic costs associated with this means of obtaining amino acids are largely a matter of speculation at the present time. This study examines the perceived cost of auxotrophic amino acids (amino acids an organism is unable to synthesize) in two bacteria (Bacillus cereus ATCC 10987 and Vibrio fischeri ES114). Auxotrophic amino acids in both organisms were found to be used preferentially in highly expressed genes and are therefore likely to be energetically inexpensive relative to those the organisms are capable of synthesizing themselves. A regression approach was used to computationally estimate the perceived costs to the organism.
I Jung Feng and Tomas Radivoyevitch. SNP-SNP interactions between dNTP supply enzymes and mismatch DNA repair in breast cancer
The dNTP supply system genes RRM1, DCTD, TYMS, TK1 and DCK balance dNTP pools to avoid incorrect insertions of bases (i.e. DNA mismatches) and the DNA mismatch repair system genes MLH1 and MSH2 are involved in removing such mismatches. The objective of this study is to explore the possibility of interactions between these two systems, since greater mismatch production rates are expected to be more detrimental in cells that also have compromised mismatch removal rates. This conjecture was explored here specifically with respect to the development of breast cancer. More than 2400 breast cancer cases and controls are included in the Cancer Genetic Markers of Susceptibility (CGEMS) single nucleotide polymorphism (SNP) dataset. For each of these individuals, a total of 99 SNP (69 dNTP supply SNPs and 30 mismatch repair SNPs) and 2070 SNP-SNP interactions between these two groups were evaluated for their effect on breast cancer using logistic regression to compute odds ratios (OR) and corresponding 95% confidence intervals (CI). Of these, 12 SNPs had statistically significant associations with breast cancer (2 SNPs with decreasing risk and 10 SNPs with increasing risk, where a total of 5 are expected by chance) and 718 SNP-SNP interaction terms were also significantly associated with lower or higher breast cancer risk (where only 2070 × 0.05 = 104 are expected). Thus, our study suggests that mismatches contribute to the formation of breast cancer.
GQ Zhang, Remo Mueller. A Scalable Parametric-RBAC Architecture for the Propagation of a Multi-Modality, Multi-Resource Informatics System
We present a scalable architecture called X-MIMI for the propagation of MIMI (Multi-modality, Multi-resource, Informatics Infrastructure System) to the biomedical research community. MIMI is a web-based system for managing the latest instruments and resources used by clinical and translational investigators. To deploy MIMI broadly, X-MIMI utilizes a parametric Role-Based Access Control model to decentralize the management of user-role assignment, facilitating the deployment and system administration in a flexible manner that minimizes operational overhead. We use Formal Concept Analysis to roles according to their permissions, resulting in a lattice hierarchy that dictates the cascades of RBAC authority. Additional components of the architecture are based on the Model-View-Controller in Ruby-on-Rails. The X-MIMI architecture provides a uniform setup interface for centers and facilities, as well as a set of seamlessly integrated scientific and administrative functionalities in a Web 2.0 environment.
Sadik Khuder and Peter Bazeley. Quantile scores for combining results from different microarray platforms.
Due to the small number of replicates in typical gene microarray experiments, the performance of statistical inference is often unsatisfactory. In this article, we present a scoring scheme, based on quantiles, that allows researchers to combine data from different platforms. We have applied the discrete-continuous normal distribution (DISCO) using the quantile scores on two publicly available data sets. Differentially expressed genes identified by DISCO are comparable to those identified by SAM or Wilcoxon rank test. An algorithm based on DISCO and quantile scores is developed to combine results from Affymetrix and Illumina. Our results indicate that combining microarray data from different platforms is possible and straightforward.
Erik Boczko, Todd Young, Minhui Xie and Di Wu. Comparison of Binary Classification Based on Signed Distance Functions with Support Vector Machines
We compare methods based on the Signed Distance Function (SDF) a new tool for binary classification with standard Support Vector Machine (SVM) methods. We demonstrate on several sets of micro-array data that the performance of the SDF based methods can match or exceed that of SVM methods.


International Society for Computational Biology grants affiliate status to the Ohio Bioinformatics Consortium
Ohio Regional Student Group
![]()
Click on the links below for the winners of the poster and paper awards at the Ohio Collaborative Conference on Bioinformatics 2009.
Paper awards.
Poster awards.
![]()