Network-based Information Integration for Protein Function Prediction
Author | : Xiaoyu Jiang |
Publisher | : |
Total Pages | : 182 |
Release | : 2009 |
Genre | : |
ISBN | : |
Download Network-based Information Integration for Protein Function Prediction Book in PDF, Epub and Kindle
Abstract: Protein function prediction is a fundamental problem in computational biology. For protein activities described by terms in databases such as the Gene Ontology (GO), this task is typically pursued as a binary classification problem. As a result of an astonishing increase in the available genome-wide protein information, integrating different protein datasets has become a significant opportunity and a major focus to infer functionality. This dissertation contains three novel approaches to integrate popular protein information to classify proteins into functional categories. A probabilistic method, Hierarchical Binomial-Neighborhood (HBN), combining proteins' relational information from the protein-protein interaction (PPI) network, together with the GO hierarchical structure, is proposed first. Results from comparing analogous models on terms from the biological process ontology and genes from the yeast genome show substantial improvement and further analysis illustrates that such an improvement is uniformly consistent with the GO depth. Being aware of the fact that the gene interaction knowledge is still incomplete in most organisms, the second approach we develop is an aggressively integrative probabilistic framework, Probabilistic Hierarchical Inferences for Protein Activity (PHIPA), with improved data usage efficiency, for combining protein relational network, categorical motif and cellular localization information and the GO hierarchy. We implement it on a network extracted from an integrative protein-protein association databases STRING (Search Tool for the Retrieval of Interacting Genes/Proteins). Being based on Nearest-Neighbor, or the "guilt-by-association" counting principle, both HBN and PHIPA use only the local neighborhood information, and are therefore built on local probabilistic models. In contrast, we develop a third approach, a fully Bayesian network-based auto-probit framework encoding the functional similarity influenced by the network topology. We not only show that the auto-probit model works equally well in prediction as the "local" methods, but also demonstrate its capability of producing more potentially interesting protein predictions by taking advantage of GO annotation uncertainty, which is critical in using and improving the GO database but yet has been ignored by most existing methodologies in this context.