Themes for MSc Thesis and Internship

Below you can find themes for potential MSc thesis or internship topics in our research group.

If you are interested in a MSc thesis with our group, then please read these instructions for how to start and complete your thesis in our group.

[MQ2project] VR Exploration of Data Distributions

The goal of the project is to create an application for the VR headset Meta Quest 2 (MQ2; previously known as Oculus). The application is meant to visualise a data distribution that is provided as input and enable interactions between the user and the data (for example, rotation, zooming, data selection, etc). The supervisor will provide a headset for the needs of the project.

[profileML] Profiling ML Pipelines

ML software programs often take the form of pipelines: they are structured as sequences of operations, that may, for example, first collect and process the data, then train models on the data, and finally use those models for some application. ML pipelines are often complex and demand significant computing resources: typical pipelines that involve large amounts of data and complex models may require many computers and several hours or days to complete.

For this reason, it is vital to profile ML pipelines, that is, to obtain an idea about how an ML pipeline will perform before it is executed. For example, profiling may estimate the time it will take for the pipeline to complete its execution and the expected result quality and detect whether limited resources (e.g., insufficient computer memory) will prevent the application from achieving the desired performance. Profiling is crucial because it identifies problems before they occur, thus saving time and costs.

Under this theme, you will examine traditional and state-of-the-art profiling techniques. Specifically, you will explore the performance of novel learned profiling techniques, which attempt to predict an ML application's performance based on other ML applications' past performance. You will compare these techniques to traditional approaches that profile an ML application through limited and controlled execution.

Background articles:

Phothilimthana, M. et.al. (2023). TpuGraphs: A Performance Prediction Dataset on Large Tensor Computational Graphs. NeurIPS 2023. [link]
- See also this Kaggle competition: https://www.kaggle.com/competitions/predict-ai-model-runtime.
Kaufman, S. J., et.al. (2021). A Learned Performance Model for Tensor Processing Units. MLSys 2021. [link]

[e2eML] Efficient End-to-End Machine Learning

Many researchers and practitioners target the predictive performance of Machine Learning (ML) models, e.g., they aim to develop ML models that have high accuracy on certain predictive tasks. By contrast, within this theme, you will work on aspects of computational efficiency, i.e., study and potentially improve the running time of ML pipelines. To do that, you will consider the various stages of ML pipelines, end-to-end (e.g., data acquisition, data sampling and processing, model training and validation, model deployment and retraining) and study how the individual and joint efficiency of these stages can be improved. For example, your thesis may study how big a data sample is necessary and sufficient for the ML training to lead to high accuracy in short time; or how often an ML model should be updated in order to maintain good model accuracy without wasteful/redundant training.

Background articles:

Mahadevan, A., & Mathioudakis, M. (2023). Cost-Effective Retraining of Machine Learning Models. https://arxiv.org/abs/2310.04216
Xin, D., Miao, H., Parameswaran, A., and Polyzotis, N. (2021). Production Machine Learning Pipelines: Empirical Analysis and Optimization Opportunities. In Proceedings of the 2021 International Conference on Management of Data (SIGMOD '21). ACM (pp. 2639–2652). https://doi.org/10.1145/3448016.3457566
Mahadevan, A., & Mathioudakis, M. (2022). Certifiable Unlearning Pipelines for Logistic Regression: An Experimental Study. Machine learning and knowledge extraction, 4(3), 591-620. https://doi.org/10.3390/make4030028
Wang, Y., Fabbri, F., & Mathioudakis, M. (2021). Fair and Representative Subset Selection from Data Streams. In Proceedings of the Web Conference 2021 (WWW '21) (pp. 1340–1350). ACM. https://doi.org/10.1145/3442381.3449799
Wang, Y., Mathioudakis, M., Li, Y., & Tan, K-L. (2021). Minimum Coresets for Maxima Representation of Multidimensional Data. In Proceedings of the 40th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems (PODS '21) (pp. 138–152). ACM. https://doi.org/10.1145/3452021.3458322

[biggraph] Massive-Scale Graph Processing

Within this theme, you will focus on the scalability of graph processing tasks for large graphs. One direction is to consider common graph-processing tasks (e.g., computation of connected components, clusters, distances, embeddings, etc) and study how existing state-of-the-art algorithms for these tasks perform for massive graphs (i.e., with more than 100 million nodes or edges). Another direction is to consider a specific massive graph (e.g., a dataset that you are interested in or that is available in the research group) and study how to perform efficiently tasks for this particular graph.

Background articles:

Merchant, A., Mathioudakis, M., & Wang, Y. (2023). Graph Summarization via Node Grouping: A Spectral Algorithm. In The 16th ACM International Conference On Web Search And Data Mining. [link]
Merchant, A., Gionis, A., & Mathioudakis, M. (2022). Succinct Graph Representations as Distance Oracles: An Experimental Evaluation. Proceedings of the VLDB Endowment, 15(11), 2297 - 2306. [link]
Merchant, A., Mahadevan, A., & Mathioudakis, M. (2022). Scalably Using Node Attributes and Graph Structure for Node Classification. Entropy, 24(7), 511-522. [906]. [link]

[textreuse] Network Analysis of Text Reuse in Historical Corpora

This theme is related to the new Academy project High Performance Computing for the Detection and Analysis of Historical Discourses (HPC-HD), which aims to detect discourses from large historical corpora of the eighteenth century (e.g., books, pamphlets, newspapers). Historians and computer scientists in the project have identified a large number of cases of text-reuse between documents in the these corpora. Text-reuse may occur for various reasons, for example because one author quotes and discusses an opinion written by another author, or even when the same author reuses text from a previous published text of their own. For this thesis, you will analyze the text-reuse network. One possible task within the thesis is to identify clusters of text reuse (groups of documents that reuse heavily text from each other) or, more appropriately, ideas and lines of thought that have evolved over time, as evidence by text reuse (when author B discusses text by author A, and later an author C comments on the text written by author B, and so on). One concrete thesis topic is to apply the method by (Sharaf et.al., 2012) on the text reuse data.

Mahadevan, A., Mathioudakis, M., Mäkelä, E., and Tolonen, M. (2024). Optimizing a Data Science System for Text Reuse Analysis. https://arxiv.org/abs/2401.07290
Zhang, J., Ryan, Y. C., Rastas, I., Ginter, F., Tolonen, M., & Babbar, R. (2022). Detecting Sequential Genre Change in Eighteenth-Century Texts. In F. Karsdorp, A. Lassche, & K. Nielbo (Eds.), Proceedings of the Computational Humanities Research Conference 2022 (pp. 243-255). (CEUR Workshop Proceedings; Vol. 3290). CEUR-WS.org. https://helda.helsinki.fi/handle/10138/351519
Dafna Shahaf, Carlos Guestrin, and Eric Horvitz. 2012. Metro maps of science. In Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining (KDD '12). Association for Computing Machinery, New York, NY, USA, 1122–1130. https://doi.org/10.1145/2339530.2339706

Completed Theses

See here: /en/researchgroups/algorithmic-data-science/people#section-98775.