2015 Winner: Enabling genotype associations of large insertions and deletions by canonicalizing variants across collections of bacterial strains

Project Information
Enabling genotype associations of large insertions and deletions by canonicalizing variants across collections of bacterial strains
Engineering
BME 195T
Comparing chromosomal variants across multiple genomes allow researchers to study the
evolution of bacterial populations. Unfortunately, this method has been limited to the comparison
of small chromosomal variants, such as single nucleotide variants and small insertions and
deletions (indels), due to the challenges of identifying larger variants with short-read sequencing
technology. Two recent variant detecting tools (Pilon and MindTheGap) that can predict indels ≥
50nt, allow comparative investigations of large variants across multiple genomes. However,
because sequencing read and assembly quality often vary, predictions of the same large indel in
multiple genomes will often differ from sample to sample, making downstream comparative
analysis difficult.

To address this challenge, I present Emu, an algorithm that can identify alternate
representations of the same predicted large indel and reduce them to a single representation. In a
benchmark analysis, I simulated 200 large indels (ranging from 50 nt to 10322 nt) across
previously published genomes of Mycobacterium tuberculosis (Mtb) and used Pilon and
MindTheGap to identify them. I found that both variant detectors called up to 5751 different
representations of the simulated indels and that Emu managed to effectively reduce them by up to
96% with high accuracy. I extended my analysis to two real sequencing data sets that included a
previously published collection of 1017 sequenced genomes of Mtb. By applying Emu to raw
variant calls from the 1017 sequenced genomes, the number of unique calls was reduced by 92%,
demonstrating the prevalence of alternative representations. Using a criterion of positive and
negative predictive values, I found that Emu more than doubled the number of large variants
associated with major Mtb phylogenetic lineages.

Emu is open-source and it is freely available on github (https://github.com/AbeelLab/emu).
PDF icon 707.pdf
Students
  • Alex Navarro Salazar (Crown)
Mentors