The tool is freely available at https://github.com/algbio/Themisto
A metanenomic sample is a set of sequences of reads from microbial life living in a particular environment. Standard analysis involves estimating the species composition of the environment by aligning the reads against a reference database. Since the age of pangenomics, alignment is preferentially done against a variation graph encompassing all variation within a species.
Themisto is a space-efficient tool for indexing such variation graphs. The Themisto index is a compressed colored de-bruijn graph of order k, where each node has a set of colors representing the reference sequences that contain the k-mer corresponding to the node. Reads are pseudoaligned to the index using a method similar to the one used by the tool Kallisto: all k-mers of the read are located in the de-bruijn graph and the intersection of the color sets of the nodes is returned.
The index is constructed in semi-external memory such that the user can limit the amount of RAM available to Themisto. Themisto utilizes the available space as much as possible, and resorts to efficient disk-streaming algorithms to keep the amount of data held in memory within the limit. The construction pipeline uses the parallel external memory k-mer counter KMC3, and three external memory merge sorts in total. The flowchart below gives a high level overview of the construction pipeline.
For pseudoalignment, the index needs to be loaded into memory as a whole, but if the reference genomes are similar to each other, the size of the index is small. The pseudoalignment matches all k-mers of the query to the index, collects the color sets of all k-mers that were found in the graph and returns their set intersection.
If you use the tool, kindly cite us as follows:
Tommi Mäklin, Teemu Kallonen, Jarno Alanko, Veli Mäkinen, Jukka Corander, Antti Honkela. Genomic Epidemiology with Mixed Samples. Supplement: Pseudoalignment in the mGEMS pipeline. Submitted Manuscript.