Notes on my Masterthesis

Created: 2024-05-19
Last Modified: 2025-01-29

While conducting my Master’s thesis I learned some valueable things. Now that I finished everything I want to collect all things which helped me in one place in hoping that this helps. This encompasses how I structured my project and the experiments, but also workflows which made my life easier.

The goal of my thesis is to analyze similarities between word embeddings and their vector spaces. Therefore, I concluded two experiments, each experiment had needs to process raw data, generate a data set, train models, collect metrics and evaluate them.

I structured the code for the thesis by experiment and kept the experiments independent of each other.
The experiments were not allowed to reference another. So experiment A was not allowed to import anything from experiment B. I some functionality is needed in both experiments there were only functionality had two choices: Copy/Paste the code (duplicate) or move it in to another package from where it was allowed to be imported.
Code duplication is okay, not doing DRY is okay, you are doing research, it can be very messy your requirements might change very fast, and they can change a lot.
Experiments should never interfere with another, meaning that all experiments should never, at any step, influence the results of the other experiment.
All steps to conclude an experiment should be executable at any time. There should never be a state where I can’t run an experiment. The only exception for this are dependencies. Generate Data → Process Data → Experiment 1 → Experiment 2 → Evaluation. After this chain has been run once every step should be runnable at any time.
I tried to keep the project layout and the goals of the experiments or processes very easy to understand. I structured the experiments by task to achieve the research goal.
Have a script to obtain the source data and make this as plain and easy to understand as possible.
Makefiles (or modern alternatives) can help with dependencies between tasks of an experiment. Also, such tools can give a descriptive name to a script lying somewhere in the project.
Adding a separate flag to only load a subset of data is very helpful during development and necessary. I had to run the code against the HPC (High Performance Cluster) of my university to actualy train the neural networks. The data did not fit on my PC, also my PC is too weak to perform the training of the neural networks.
The gradients of an ARM machine can differ from the gradients on an x86 machine. I was fascinated and shocked by that.
When working on an HPC (high performance cluster) from your university always find the easiest way to have the same environment for development on your machine and on the HPC. For me, it was using conda.
Structure your experiments and document them. Why do you run this experiment? Which question is this experiment helping you find an answer to?
Have one source of truth for the results. Don’t mix them through different machines. I discarded all calculations from my local machine and only kept the ones from the cluster. Also, I hard separated them by directory structure.
Always write your logs and log everything either by printing and then piping it to a file or by using a good logger. I like https://github.com/Delaan/loguru for Python
Don’t overwrite your logs. Keep all of them. Add a timestamp in the filename of the log, or append whenever you run. But never overwrite them.
Avoid saving plots as SVG and prefer PDF. PDFs are viewable nearly anywhere, SVGs are not.
You can include a PDF just like that in LaTeX, while you need a separate package and inkscape for SVG. Also, the graphs from matplotlib are not nicely resizeable when written to an SVG.
It can make a lot of sense to split the plotting code in two parts: loading and processing data and then writing only the data required for plotting and then plotting the data. This makes the plotting itself faster and changing details on the plot is faster too.
Write scripts for syncing data to/from the cluster. rsync or even scp are great tools!
Simple is better than clever
Start writing while waiting for results
Sometimes it’s better to sit back and take a pen and paper to structure the thesis
Ask friends to read the sections of your writings! It doesn’t matter if they are from the same field as you are! You will have mistakes in your writing which you will never find!
Talk to your professor regularly about obstacles. Structure the meeting accordingly.
Make mistakes, learn and understand what went wrong, then fix them
Collect all necessary papers and citations at reading-time
Use a version control system (hint: Git is a good choice)