21. INFERRING SPARSE MULTIVARIATE MODELS TO PREDICT DISEASE PHENOTYPE FROM GENOTYPE

Department: Bioengineering
Faculty Advisor(s): Trey Ideker

Primary Student
Name: Matan Hofri
Email: mhofree@ucsd.edu
Phone: 858-822-4667
Grad Year: 2014

Abstract
Genome-wide association studies hold great promise for unraveling the genetic basis of complex diseases. However, while moderately successful for a certain subset of diseases, the results delivered by classical statistical methods have fallen short of their clinical goal of building disease predictive models. Several attempts have been made to use supervised learning techniques to predict disease outcome and risk based on genotypes. These studies typically learn classifiers relying on embedding the data in a very high dimensional space (e.g., Support Vector Machines), or highly redundant bootstrapping (Random forests), making interpretation of the results challenging. Furthermore, most of these methods pre- lter the genetic variants examined to a subset, using classical statistical approaches, making them a poor alternative for the discovery of new disease associations. In this work, we use Adaboost, a large-margin classi er, to learn linear models that predict case-control status in two independent cohorts of type I diabetes mellitus, demonstrating state of the art classi cation performance. We suggest a simple but powerful method for overcoming limitations of Adaboost that are due to the linkage structure between genetic variants. We demonstrate signi cant overlap in regions selected by boosting across the two cohorts, including 28 replicated genes which have not been detected through the use of classical statistical tests. Of these gene hits, 13 have been previously implicated in the literature. We show how genes selected by boosting across both cohorts are substantially enriched in type I diabetes pathways. Finally, three such pathways are found enriched using boosting on both cohorts and are either not replicated or not enriched when using p-value based methods. Our results suggest that through the use of large-margin classi cation algorithms we can discover a landscape of disease associated genes, not identi ed through other existing methods.

« Back to Posters or Search Results