Luigi's guide to writing Master's theses (in Data Science)

This page contains a set of personal guidelines, suggestions and advice on how to write a Master's thesis. This page is not specifically on how to do research during the project (although I might write another guide at some point).

Luigi Acerbi, University of Helsinki, Finland
Last edited: 15 Mar 2024 (added section at the end on use of Large Language Models)

These recommendations are aimed primarily to my students from the Master's Programme in Data Science at the University of Helsinki, but many points are likely to apply to related programmes and other institutions. In fact, most of this guide generalizes to scientific academic writing in general (e.g., articles, PhD theses).

Disclaimer: There are loads of better materials elsewhere online; this page is mostly a collection of advice I realized I was repeating to multiple students, so I thought to put it in writing in a single place. Many of these points are not absolute rules, but my own sometimes-idiosyncratic opinions and personal recommendations: always double-check with your thesis advisor.

Before you start:

  • Programme instructions: Read carefully all the instructions about Master's theses provided by the Programme, which you can find at this link (select your programme from the menu).
  • Planning and deadlines: Writing the Master's thesis will take at the very minimum a full month, probably more, so plan accordingly. Be aware of the deadlines for submitting your thesis for review, and bring this up well in advance with your supervisor if you want to graduate at a given time. Generally, your reviewers will need at least a month to read and review the thesis, and the second reviewer (who is typically not directly involved in the thesis work) will need to be warned in advance. On top of that, there is the extra time needed to submit the thesis to the steering committee for approval. See the programme instructions above.
  • Start writing before you start writing: Even before officially starting the writing period, write down what you do (e.g., in a scrap LaTeX document). Write partial results, derivations, notes, etc. - no need to be particularly organized at this stage. However, if you already have written down bits and pieces of your work, it will make your life much easier later.
  • Look up other theses: Nothing better than learning by example. Since almost all University of Helsinki theses are published in the Helka database, it is easy to look at other theses. For example, this query will display all theses from the Master's programme in Data Science.

Workflow:

  • Supervisor feedback: When writing the thesis it is wise to finish at least one or two chapters early on (e.g., Introduction and Background), and share them with your supervisor so that they can point out major issues before they are all over the thesis (citations, grammar, structure...). To make the most out of this step, the shared chapters should be in a decent shape — that is, as close as possible to how you would consider them "finished". This way, the supervisor can give real feedback on your writing — as opposed to commenting on a clearly unfinished and unpolished draft, which is not so useful. More generally, agree early on with your thesis supervisor how the feedback on the thesis will work.
  • Meetings: The meetings are not for you to prove your supervisor that you have done something. Your supervisor is there to help you. The meetings are for your benefit. To make most of the meetings, always try to bring something to the meeting that you can discuss or get feedback from.
  • Meeting notes: After each meeting with you supervisor, especially if you do not meet that often, it is a very good habit to write a brief recap as a bullet point list. Briefly summarize comments/clarifications about the work discussed during the metting, and list the action points about the things you need to work on next. Send the recap to your supervisor (via email or on Slack). This is a very good way to keep track of progress and potentially clarify misunderstandings.

General thesis:

  • Layout: If you are in the hard or exact sciences, do not even think of writing your thesis in anything but LaTeX. Use the LaTeX template provided by the University of Helsinki (or your institution). If you are MSc Data Science student, you can find the template in the Moodle course Data Science MSc Thesis. For example, you can use Overleaf as an easy entry to LaTeX. The University of Helsinki has a premium license for Overleaf, accessible by signing in with your institution credentials.
  • Length: While there is no set requirement for length, a typical Master's thesis will be between 40-60 pages. This is just a broad guideline; of course, nobody will complain if you solve the Riemann hypothesis in 10 pages. Longer theses are also possible, but think whether all the content is needed; if you really think that everything is essential, at least consider putting something in one or multiple Appendices.
  • Structure: While titles can change, a typical thesis will have: a Summary/Abstract; an Introduction chapter; a Preliminaries/Background chapter covering the literature review with background theory and tools used in your thesis (cover only what you need for your work, no need to write a full textbook or to show off knowledge of unrelated topics); likely a Methods chapter explaining more in detail what you actually did in the thesis (e.g., describe your model(s), your data, your theory); a Results chapter showing your method applied to the data; and a final Discussion chapter summarizing the thesis and conclusions. If needed, you could also have an Appendix for extra material. Of course, these are just generic guidelines - depending on your thesis work, you might have two chapters with results, or no methods chapter, etc.
  • Consistency: The Master's thesis is a unified scholarly work so pay particular attention to consistency of notation, figures, tables, naming conventions, etc. across sections and chapters etc. (see below for more examples).

Content:

  • Level of detail: Generally speaking, the thesis is about reporting what you did in a scientific way. Finding the right level of detail can be tricky, but try to be both informative and brief. You do not need to write every single detail - the thesis is not a diary of what you did. On the other hand you need to provide enough information so that the reader can figure out what you actually worked on and obtained.
  • Target readership: The ideal target reader for the thesis is a peer Master's student, i.e. someone from your programme who may have taken a few different courses from you and ended up working on a completely different project for the thesis. So, when writing the thesis, think carefully what you can take for granted (e.g., you can safely assume that the reader knows what a real number is, what Numpy is, but also what K-means is), and what you may have to explain (e.g., you might have to at least write a paragraph or short section on what a Gaussian process is). As a rule of thumb, anything that you did not know before starting the thesis should be explained.
  • Negative results: Most explorations and attempts in science do not work, and the same is true for data science and machine learning. Luckily, the Master's thesis is not a NeurIPS submission, so there can be plenty of merit in exploring and reporting "negative" results. It is totally fine and in fact quite normal to report negative results in a Master's thesis (i.e., things which did not work as planned), but try to keep a scientific approach. If the proposed method did not work, can you explain (with evidence) or at least hypothesize why it didn't work? Would you have a proposal on what could be done to fix it, had it there been more time?

Equations:

  • Text format: Ensure that function names are not written in italic, e.g. in LaTeX use "\exp" and "\log" and not "exp" and "log". Similarly, use "\text{}" as needed for textual elements that are not variables. For example, use $\hat{\theta}_\text{MAP}$ to denote the maximum-a-posteriori estimate, as opposed to $\hat{\theta}_{MAP}$. We do not want "MAP" to be in italic, since it is not a variable name.
  • Notation: Even for relatively common notation, explain the notation you are using (unless it's truly basic). This is a must when there might be similar notations out there. For example, "$\mathcal{N}\left(x; \mu, \sigma^2\right)$ denotes the probability density function of a normal distribution with mean $\mu$ and variance $\sigma^2$."

Figures:

  • Basic presentation: Check that you have labelled all the axes, you have a legend if needed, the figure is readable (e.g., font size is large enough). Figures should look pretty.
  • Captions: Figure captions should be brief but informative. Describe briefly what the axes are, what is being represented in the figure (or in each panel). For example, if there is a color map, what's the color map representing?
    Describe what the reader should look at and give a brief takeaway message if possible, i.e. why this figure is here? What is it showing of interest?
  • Link from the text: Figures are somewhat independent from the text in that they should mostly read stand-alone. However, Figures should be always referred to from the text, ideally just before or just after the figure (e.g., "As shown in Figure 1, [...]").
  • Consistency: Check for consistency across figures. For example, font size, color map, naming of axes, ordering of variables, etc. should be as consistent as possible for figures in the same work (here, in the same thesis). Additional consistency, if possible without harming presentation, is a bonus (e.g., consistent axis limits across figures).
  • License: You are allowed to include figures which are in the public domain or for example with a CC-BY 4.0 license. Just be sure you are specifying somewhere (e.g, in the caption) the source and its license. In most cases, be mindful to modify the figure for your purposes, do not just copy-paste it (e.g., you might not need all the details from the original figure, or you might have to modify something to keep consistency with the rest of your thesis).
  • Format: If you can, try to render figures as vector-based graphics such as pdf or svg, rather than as bitmap (png or jpg), to make them sharper and smaller in filesize.

Tables:

  • General comments: What written above for Figures generally applies to Tables too. Make sure that the tables are well-presented, readable, the caption is explanatory, the layout is consistent, etc.
  • Figure or Table?: Think if the content of a table could be better conveyed using a figure (and vice versa: a very cluttered figure could become a neat table).

References:

  • How many: The Master's thesis is a piece of scholarly work, so we expect to see appropriate citations to the literature (especially in the introductory and preliminary parts, but also later). Again, there is no set requirement for number of citations, but if your thesis cites less than ten articles / conference papers / books you could probably do a bit more of literature review, or be more mindful in citing the papers related to the methods you are using.
  • Format: There are many different bibliography options to choose from in LaTeX. For a thesis, I recommend against the default number-only citation format, which is non-informative and hard to parse for humans (e.g., what's citation [31]?). Instead, use a "(Author(s), Year)" citation format, or alternatively the alphanumeric [authors' initials + year] format, e.g. "[ABC22]". One or the other might be more popular, depending on the community. If you use the author + year format, be sure to appropriately use \citep{} or \citet{} depending on the context.
  • Bibliography check: Double and triple-check your formatted bibliography as generated by LaTeX. You will likely be using BibTex in LaTeX for your bibliography. Check that your .bib files are correct and that the references appear correctly in the bibliography. It is very easy that .bib entries taken from e.g. Google Scholar have missing parts (name of the journal, page number, even authors), or perhaps refer to an earlier arXiv preprint while the paper has been in the meantime published in a journal or conference. It is your job as a scholar to ensure the up-to-date validity and correctness of your bibliography. So be sure to read through the generated PDF to at least spot glaring omissions, and put effort in polishing the bibliography.
  • Capitalization: As a subset of the above check, you want to capitalize words in article titles correctly (e.g., "Bayesian" should be capitalized, not "bayesian"). In a .bib entry, you can force BibTex to keep the capitalization by using curly brackets around letters. For example, you can write "title={{V}ariational {B}ayesian {M}onte {C}arlo}" to ensure proper capitalization.
  • Reference management software: To keep track of references used during the thesis work and writing, it might be useful to use some reference management software (beyond a .bib file), such as Zotero or Paperpile.

Miscellanea:

  • Footnotes: Unless stated otherwise, you can have footnotes in your thesis; which can be a good way to add side information without cluttering the main text. Use them sparsely and wisely.
  • Spell-checking: Wherever you write your .tex files, you should find a way to run a spell-checker at least at the end, when you are polishing the text. For example, there should be one in Overleaf. A spell-checker should be able to catch the most obvious mistakes. Google's spell-checking (in doc and gmail, for example) is also very good. One notorious point to be aware of for some non-native English speakers, including myself, is the usage of articles (i.e., "the" or "a") which can be quite random, but modern spell checkers (at least the Google ones) are able to spot this and recommend when an article should be added or removed.
  • Perspective: While important as the final step of your Master's studies, keep in mind this is a Master's thesis and not a doctoral dissertation, so make sure that the scope is appropriate, do not spend too many months on it. If you plan to continue with research, consider applying for a PhD position, which sometimes could build on top of your Master's work (if not directly, at least in terms of gained experience).
  • Writing guides: Consider reading a book about scientific/academic writing (or at least some blogs/articles). There are also online courses available on the topic, such as on Edx and Coursera, which are often free if you only audit the course. As an example, read about "topic sentences" and see if they could help your writing process.

Use of Large Language Models (LLMs):

  • Master's in Data Science Guidelines: First of all, check the guidelines about AI usage provided for the Master's Programme in Data Science. In particular, "If you use a language model to produce the work you are returning, you must report in writing which model (e.g. ChatGPT, DeepL) you have used and in what way. This also applies to theses." Disclose the usage in the Acknowledgments or some other appropriate section of the thesis.
  • Beware of nonsense: It is your job to check that the output produced by the LLM is correct and meaningful. In particular, be aware of the tendency of many LLMs, when dealing with complex technical contexts, of spouting well-written word salads which really mean absolutely nothing. Another common problem is that LLMs can deal badly with technical words, for example substituting technical words with non-technical ones, with hilarious effects.
  • Writing style: The writing style of some LLMs (ChatGPT in particular) is recognizable from miles away, objectively terrible, stereotyped, and banal. The sentence structure used by ChatGPT (at least at the time of writing) is over-repetitive, as well as its choice of wording ("leverage", "innovative", "tapestry", etc.). Moreover, ChatGPT has a tendency of overselling its content ("innovative", "groundbreaking", etc.). This is not generally true of all LLMs (and definitely is not a statement about their general writing capabilities; even GPT-3 could write better than this), but just a fact to be aware of and which might not be obvious if you have not spent many hours reading ChatGPT-generated content.
  • Knowledge gathering: Writing aside, there are other areas where modern LLMs and AI tools can help during a Master's thesis. For example, in addition to Google Scholar and other academic repositories and good old-fashioned search, LLM-based services such as Perplexity and Elicit (and others) can be successfully used as additional tools to expand the way a researcher can explore the literature, find related work, and improve their understanding.
  • Conclusions: In sum, I am in favor of using LLMs in writing especially to break the "blank page" and get things started, or occasionally to suggest how to write or rewrite a paragraph or sentence, but you need to remain in full control of the produced text. In particular, do not switch off your critical thinking and do not blindly accept what the LLM is proposing. We know what the extreme of that might look like.

Acknowledgments:

Thanks to Antti Honkela, Marlon Tobaben and Aki Rehn for useful comments and suggestions.