93. multiple sequence alignments for ultra-large reference 16s datasets: combining a divide-and-conquer framework with rna structural models
Name: Uyen To Mai
Grad Year: 2021
In microbial research, it is a typical approach to use a reference dataset, such as the Greengenes, in combination with phylogenetic methods, to study microbial samples. Therefore, a computational method that can construct high quality MSA and phylogenetic trees for those referenced datasets is highly in demand, but is surely nontrivial. Previous works in MSA and phylogenetic tree reconstruction have shown that Divide-and-conquer methods, such as those used in PASTA and UPP, are scalable and highly accurate in constructing MSA for ultra-large biological datasets of up to a million sequences. On the other end of the spectrum, since RNA secondary structures are available for a large number of SSU sequences, using pre-built structural models (i.e. SSU-align) is another approach to construct MSA for SSU sequences. In this project, we aim at an innovative method to combine the structural models in SSU-align with the divide-and-conquer framework in PASTA to enhance accuracy in constructing MSA for ultra-large 16S datasets. Such a new method can be further developed into a fully automatic pipeline for building and updating large-scale 16S reference datasets, for which the Greengenes - a dataset of hundreds of thousands environmental microbial samples - is an excellent example.
Industry Application Area(s)
Life Sciences/Medical Devices & Instruments