Software | Algorithmic bioinformatics | University of Helsinki

LoRDEC

LoRDEC is an alignment-free method for the correction of sequencing errors in long but highly erroneous reads with the help of short and accurate reads. The method is based on building a de Bruijn graph of the short reads and then threading the long reads through this graph. The method achieves similar accuracy as previous methods but is six times faster and uses 93% less memory than previous methods. The efficiency of the method is due to the use of state-of-the-art data structures with high quality implementations. LoRDEC has been used in at least 40 published genome and transcriptomics projects. See the LoRDEC page for more details.

Salmela, Rivals: LoRDEC: accurate and efficient long read error correction. Bioinformatics 30(24):3506-3514, 2014.

MIPUP

Discovering the evolution of a tumor may help identify driver mutations and provide a more comprehensive view on the history of the tumor. Recent studies have tackled this problem using multiple samples sequenced from a tumor, and due to clinical implications, this has attracted great interest. However, such samples usually mix several distinct tumor subclones, which confounds the discovery of the tumor phylogeny.

We study a natural problem formulation requiring to decompose the tumor samples into several subclones with the objective of forming a minimum perfect phylogeny. We propose an Integer Linear Programming formulation for it, and implement it into a method called MIPUP. We tested the ability of MIPUP and of four popular tools LICHeE, AncesTree, CITUP, Treeomics to reconstruct the tumor phylogeny. On simulated data, MIPUP shows up to a 34% improvement under the ancestor-descendant relations metric. On four real datasets, MIPUP’s reconstructions proved to be generally more faithful than those of LICHeE.

MIPUP is available at https://github.com/zhero9/MIPUP.

MIPUP: Minimum perfect unmixed phylogenies for multi-sampled tumors via branchings and ILP, E. Husić, X. Li, A. Hujdurović, M. Mehine, R. Rizzi, V. Mäkinen, M. Milanič, A. I. Tomescu, Bioinformatics, Volume 35, Issue 5, 01 March 2019, Pages 769–777
The minimum conflict-free row split problem revisited, A. Hujdurović, E. Husić, M. Milanič, R. Rizzi, A. I. Tomescu.
Proceedings of the 43rd International Workshop on Graph-Theoretic Concepts in Computer Science (WG 2017), Lecture Notes in Computer Science 10520 (2017) 303-315. (full preprint)
Perfect phylogenies via branchings in acyclic digraphs and a generalization of Dilworth's theorem, A. Hujdurović, E. Husić, M. Milanič, R. Rizzi, A. I. Tomescu, ACM Transactions on Algorithms 14 (2018) Art. 20, 26 pp.

Gap2Seq

Gap filling is the last phase of de novo genome assembly where gaps between consecutive contigs in scaffolds are filled. We present a rigorous formulation of the gap filling problem. Gap2Seq provides an implementation of a pseudopolynomial algorithm for this NP-complete problem. Furthermore, Gap2Seq classifies the bases used to fill the gaps into safe and unsafe ones where the safe bases are present in each possible solution to the gap filling problem. A version of Gap2Seq tailored for insertion genotyping is also available.

L. Salmela and A.I. Tomescu: Safely filling gaps with partial solutions common to all solutions. IEEE/ACM Transactions on Computational Biology and Bioinformatics, Volume 16, Issue 2, 2019, 617–626.
R. Walve, L. Salmela, and V. Mäkinen: Variant genotyping with gap filling. PLoS ONE, Volume 12, Issue 9, 2017, e0184608.
L. Salmela, K. Sahlin, V. Mäkinen, and A.I. Tomescu: Gap filling as exact path length problem. Journal of Computational Biology, Volume 23, Issue 5, 2016, 347–361.

Metaflow

High-throughput sequencing (HTS) of metagenomes is proving essential in understanding the environment and diseases. State-of-the-art methods for discovering the species and their abundance in an HTS metagenomic sample are based on genome-specific markers, which can lead to skewed results, especially at species level.

Metaflow is an accurate method based on coverage analysis across entire genomes that also scales to HTS samples.

Ahmed Sobih, Alexandru I. Tomescu, Veli Mäkinen. MetaFlow: Metagenomic profiling based on whole-genome coverage analysis with min-cost flows, RECOMB 2016 – 20th Annual International Conference on Research in Computational Molecular Biology, Lecture Notes in Computer Science 9649, 111–121, 2016

LoRMA

The third generation sequencing technologies, such as Pacbio SMRT sequencing, produce long sequencing reads but with an error rate of about 15%. The sequencing errors, which include numerous insertions and deletions in addition to substitutions, complicate downstream analysis of the data such as genome assembly and mapping of the reads. LoRMA extends our previous hybrid approach using both short and long reads to a method using only long reads.

L. Salmela, R. Walve, E. Rivals, and E. Ukkonen: Accurate selfcorrection of errors in long reads using de Bruijn graphs. Bioinformatics, Volume 33, Issue 6, 2017, 799–806.