93. multiple sequence alignments for ultra-large reference 16s datasets: combining a divide-and-conquer framework with rna structural models

Department: Computer Science & Engineering
Faculty Advisor(s): Siavash Mirarab (Mir Arabbaygi)

Primary Student
Name: Uyen To Mai
Email: umai@ucsd.edu
Phone: 858-752-1714
Grad Year: 2021

In microbial research, it is a typical approach to use a reference dataset, such as the Greengenes, in combination with phylogenetic methods, to study microbial samples. Therefore, a computational method that can construct high quality MSA and phylogenetic trees for those referenced datasets is highly in demand, but is surely nontrivial. Previous works in MSA and phylogenetic tree reconstruction have shown that Divide-and-conquer methods, such as those used in PASTA and UPP, are scalable and highly accurate in constructing MSA for ultra-large biological datasets of up to a million sequences. On the other end of the spectrum, since RNA secondary structures are available for a large number of SSU sequences, using pre-built structural models (i.e. SSU-align) is another approach to construct MSA for SSU sequences. In this project, we aim at an innovative method to combine the structural models in SSU-align with the divide-and-conquer framework in PASTA to enhance accuracy in constructing MSA for ultra-large 16S datasets. Such a new method can be further developed into a fully automatic pipeline for building and updating large-scale 16S reference datasets, for which the Greengenes - a dataset of hundreds of thousands environmental microbial samples - is an excellent example.

Industry Application Area(s)
Life Sciences/Medical Devices & Instruments

« Back to Posters or Search Results