Algorithms for Biological Sequencing Data

The team led by University Lecturer Leena Salmela focuses on algorithms for genome assembly.

Research

Genome assembly. Determining the genomic sequence of an organism is a fundamental task in molecular biology. Current sequencing technologies are not able to read the whole genome at once but instead produce sets of short reads, i.e. fragments of the genome, which must then be assembled. We have previously worked on several phases of fragment assembly including sequencing error correction, scaffolding, and gap filling. Together with our biological collaborators we have sequenced and assembled the genome of the Glanville fritillary butterfly which is the first large genome sequenced in Finland. Currently we work on integrating long range data such as genetic linkage maps and optical mapping data to read based genome assembly.

Optical mapping data. Optical maps are produced by immobilising ensembles of DNA molecules on a plate and applying a restriction enzyme to cut the DNA molecules at a specific DNA motif. The molecules are then imaged and the cutting sites can be read from the image thus capturing the relative order and size of fragments between the cut sites. Optical mapping data spans longer genomic regions than sequencing reads and can thus complement read based analysis of genomic data. We have developed algorithms for correcting errors in optical mapping data and to integrate optical mapping data to genome assembly.

De Bruijn graphs. The de Bruijn graph is an important data structure for processing data produced by second generation sequencing machines which produce short but accurate sequencing reads. We have used de Bruijn graphs to develop methods for e.g. sequencing error correction and gap filling. Our current projects include development of de Bruijn graphs suitable for third generation sequencing data.

People

  • University Lecturer Leena Salmela
  • Postdoctoral Researcher Diego Díaz-Domínguez
  • PhD Student Miika Leinonen

Alumni

  • Research Assistant Essi Tepponen
  • PhD Riku Walve
  • Research Assistant Silvia Nepšinská
  • Taku Onodera (now assistant professor at University of Tokyo)

Recent Publications

  • D. Díaz-Domínguez, M. Leinonen, and L. Salmela: Space-efficient computation of k-mer dictionaries for large values of k. Algorithms for Molecular Biology, Volume 19, Article number 14, 2024.
    [Article online] [Implementation]
  • M. Leinonen and L. Salmela: SAKE: Strobemer-assisted k-mer extraction. PLoS ONE, Volume 18, Issue 11, e0294415, 2023.
    [Article online] [Implementation]
  • D. Díaz-Domínguez and L. Salmela: Computing all-vs-all MEMs in grammar-compressed text. In Proc. SPIRE 2023, International symposium on String Processing and Information Retrieval (ed. F.M. Nardini, N. Pisanti, and R. Venturini), Lecture Notes in Computer Science 14240, Springer, 2023, 157-170.
    [Article online]
  • J. Ma, M. Cáceres, L. Salmela, V. Mäkinen, A.I. Tomescu: Chaining for accurate alignment of erroneous long reads to acyclic variation graphs. Bioinformatics, Volume 39, Issue 8, btad460, 2023.
    [Article online]
  • D. Díaz-Domínguez, S. Dönges, S. Puglisi, and L. Salmela: Simple runs-bounded FM-index designs are fast. In Proc. SEA 2023, International Symposium on Experimental Algorithms (ed. L. Georgiadis), Leibniz International Proceedings in Informatics (LIPIcs) 265, Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik, 2023, 7:1-7:16.
    [Article online]
  • B. Freire, S. Ladra, J.R. Paramá, and L. Salmela: ViQUF: De novo viral quasispecies reconstruction using unitig-based flow networks. IEEE/ACM Transactions on Computational Biology and Bioinformatics, Volume 20, 2023, 1550-1562.
    [Article online] [Implementation]
  • D. Díaz-Domínguez, S.J. Puglisi, and L. Salmela: Computing all-vs-all MEMs in run-length encoded collections of HiFi reads. In Proc. SPIRE 2022, International symposium on String Processing and Information Retrieval (ed. D. Arroyuelo and B. Poblete), Lecture Notes in Computer Science 13617, Springer, 2022, 198-213.
    [Article online]
  • M. Leinonen and L. Salmela: Extraction of long k-mers using spaced seeds. IEEE/ACM Transactions on Computational Biology and Bioinformatics, Volume 19, 2022, 3444–3455.
    [Article online] [Implementation]
  • R. Walve, S.J. Puglisi, and L. Salmela: Space-efficient indexing of spaced seeds for accurate overlap computation of raw optical mapping data. IEEE/ACM Transactions on Computational Biology and Bioinformatics, Volume 19, 2022, 2454-2462.
    [Article online] [Implementation]
  • D. Diaz and G. Navarro: Efficient Construction of the BWT for Repetitive Text Using String Compression. In Proc. CPM 2022, Annual Symposium on Combinatorial Pattern Matching (ed. H. Bannai and J. Holub), Leibniz International Proceedings in Informatics (LIPIcs) 223, Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik, 2022, 29:1-29:18.
    [Article online]

Software