Topics for Master's Thesis

Below you can find topics that are currently available in the Algorithmic Data Science group.

IMPORTANT Please make sure to read these instructions for how to start and complete your thesis in our group.

Topics

Topic 1: Cost-Optimal Machine Learning on the Cloud

Today, many companies and other organizations perform their Machine Learning (ML) operations on cloud-computing platforms, such as Amazon AWS, Microsoft Azure, and Google Cloud. Such "cloud-ML" platforms typically offer multiple options for hardware with varying levels of processing power, memory, and storage capacity to be rented at different prices.

Naturally, users of such platforms are interested not only in optimizing the quality of the ML operations (e.g., training accurate ML models efficiently) but also minimizing the cloud-computing costs.

Possible thesis contributions:

  • Conduct a survey of companies and other users of cloud-ML platforms, in Finland and elsewhere, to explore how they use the platforms for ML.
  • For typical ML operations (e.g., training or updating ML models on a given dataset, or deploying these models for inference), consider different hardware configurations and determine which ones lead to the minimal cost. For example, consider the operation of training an ML model over a given dataset. In this case, a computing node with lower processing power may be associated with a lower rental price per hour, but it might require longer time to train the model --- thus potentially leading to higher rental cost than a computing node than is more expensive per hour but also provides faster ML training.

Related literature:
Doris Xin, Hui Miao, Aditya Parameswaran, Neoklis Polyzotis. Production Machine Learning Pipelines: Empirical Analysis and Optimization Opportunities. In SIGMOD 2021. Link on ACM.
Viktor Leis, Maximilian Kuschewski. Towards Cost-Optimal Query Processing in the Cloud. In VLDB 2021. Link on vldb.org.

Topic 2: Materialization of Bayesian Networks

Variable Elimination is a fundamental algorithm for probabilistic inference over Bayesian Networks (BNs). In recent work, we have developed a materialization scheme for BN models, such that the expected running time of Variable Elimination over a given query workload is optimal under given space constraints.

Current limitations and possible thesis contributions:

  • The current implementation assumes that the Bayesian network remains fixed over time. If even a small part of the BN gets updated, then the materialization has to be performed again from scratch for the entire BN. Thesis contribution: Adapt the current implementation to enable incremental updates to the BN materialization.
  • The current implementation uses the entire Bayesian network for answering any possible inference queries. In fact, however, part of the Bayesian network may be redundant, depending on the inference query. To take such redundancy into account, we have devised a materialization scheme to take into account redundancy. Thesis contribution: Implement the redundancy-aware materialization scheme.

Existing implementation:
Query the Model (QtM) https://github.com/aslayci/qtm

Reference:

Aslay, Cigdem, Martino Ciaperoni, Aristides Gionis, and Michael Mathioudakis. "Workload-aware materialization for efficient variable elimination on Bayesian networks." In Proceedings of the IEEE International Conference on Data Engineering, 2021. Extended version at https://arxiv.org/abs/1904.00079.

Topic 3: Node Classification on Large Networks

Semi-supervised node classification is a common task in network analysis. Given the labels for some of the nodes in a network, we aim to predict the labels for the remaining nodes. In our on-going work (Merchant and Mathioudakis, 2020), we have developed a novel approach for this task that is based on network embeddings. However, computing the embedding of a given network is a bottleneck for this approach.

Thesis contribution: Enhance the existing implementation of the proposed approach to make it scalable for larger networks, e.g., via parallelization, or state-of-the-art techniques for computation of network embeddings (Akyildiz et.al., 2020; Zhao et.al., 2020).

Existing implementation:
JANE https://version.helsinki.fi/ads/jane.

References:

Merchant, Arpit, and Michael Mathioudakis. "Joint Use of Node Attributes and Proximity for Semi-Supervised Classification on Graphs". International Conference on Complex Networks. 2021. Springer-Verlag (to appear). Pre-print: https://arxiv.org/abs/2010.11536

Akyildiz, Taha Atahan, Amro Alabsi Aljundi, and Kamer Kaya. 2020. GOSH: Embedding Big Graphs on Small Hardware. In 49th International Conference on Parallel Processing - ICPP (ICPP '20). Association for Computing Machinery, New York, NY, USA, Article 4, 1–11. DOI:https://doi.org/10.1145/3404397.3404456

Zhao, Zhiqiang, and Ying Zhang, and Zhuo Feng. 2021. Towards Scalable Spectral Embedding and Data Visualization via Spectral Coarsening. In Proceedings of the 14th ACM International Conference on Web Search and Data Mining (WSDM '21). Association for Computing Machinery, New York, NY, USA, 869–877. DOI: https://doi.org/10.1145/3437963.3441767

Completed Theses

See here: https://www2.helsinki.fi/en/researchgroups/algorithmic-data-science/people#section-98775.