|27.04.2016||The RECOMB slides for MetaFlow are available here.|
|29.02.2016||This page has been updated with an overview of MetaFlow.|
|06.02.2016||Version 0.9.2 released. This includes a Python script to easily create a BLAST database against which to align the reads.|
|30.01.2016||The extended manuscript describing MetaFlow is available on bioRxiv.|
|18.12.2015||The paper describing MetaFlow has been accepted for presentation at RECOMB 2016.|
MetaFlow's latest version is 0.9.2, which is freely available on Github
If you use MetaFlow, please cite it as:
Ahmed Sobih, Alexandru I. Tomescu, Veli Mäkinen, MetaFlow: Metagenomic profiling based on whole-genome coverage analysis with min-cost flows, RECOMB 2016 – 20th Annual International Conference on Research in Computational Molecular Biology, Lecture Notes in Computer Science 9649, 111–121, 2016
High-throughput sequencing (HTS) of metagenomes is proving essential in understanding the environment and diseases. State-of-the-art methods for discovering the species and their abundance in an HTS metagenomic sample are based on genome-specific markers, which can lead to skewed results, especially at species level. On synthetic data sets of varying complexity, MetaFlow is more precise and sensitive than popular tools such as MetaPhlAn, mOTU, GSMer and BLAST, and its abundance estimations at species level are two to four times better in terms of L₁-norm. On a real human stool dataset, MetaFlow identifies B.uniformis as most predominant, in line with previous human gut studies, whereas marker-based methods report it as rare.
MetaFlow is the first method based on coverage analysis across entire genomes that also scales to HTS samples. MetaFlow's main objectives in deciding to which reference bacterial genome to map an HTS read are:
- few coverage gaps in the resulting read coverage
- the resulting read coverage should be almost uniform
- agreement between read mapping scores and final mappings
The figure below illustrates the idea behind MetaFlow and previous state-of-the-art methods.
In the yellow box, reads are aligned only to genomic markers, and the relative abundances are highly skewed: marker 1 receives more false mappings (in red) because it is similar with a subsequence of the second genome (gray); marker 2 has a drop in coverage due to a sequencing bias, and it is covered only by three reads (orange); marker 3 is covered also by some reads sequenced from a species not in the reference database (violet). In the green box, reads have whole-genome BLAST alignments, but the relative abundances are still skewed: the tie in the alignment of the red reads is not resolved, and the violet reads from an unknown species aligning to the third genome are not removed. In the blue box, the coverage-sensitive mapping of the reads: the red reads are correctly aligned to the second genome (in the gray sequence) and the violet reads from an unknown species are discarded from the third genome. The abundances are relative to the known genomes.
We formulated this problem as an NP-hard matching problem in a bipartite graph, which we solved in practice by min-cost flows. One set of this bipartition is made up the reads. The other set is made up of the reference bacterial genomes: each genome is split into segments of equal length, and each of these form a node in the set. We add an edge from a read to any of the genome segments to which it aligns. For example:
On the left, the bipartite grpah constructed from the initial read alignments. On the right, randomly choosing one alignment for every read gives a very variable read coverage.
An optimal read mapping is one having few coverage gaps and almost uniform coverage. For example:
On the left, each read has a unique mapping position, in orange. On the right, the resulting coverage is almost uniform.
Our problem formulation is inspired by a paper by Christine Lo, Sangwoo Kim, Shay Zakov, Vineet Bafna concerned about mapping reads to a single reference genome, with the same uniform coverage objective. MetaFlow extends this model to multiple genomes, and also alowing read to come from an unknown genome. The resulting multi-genome problem formulation is now NP-hard. In practice, we solved it with min-cost flows. See our manuscript for details.