SS25_DS4DH2: First Task PMI

Hey Anne,
Yes, we want you to implement it from scratch because this is also an excellent opportunity to learn (or repeat) the absolute basics of NLP (string-preprocessing, tokenization, word-counting, filtering, etc.) :)
--
Regarding your PMI "issue": That's a great observation, and a PMI's bias towards infrequent events is a well-known deficiency of this metric!
There exist several workarounds to alleviate this:
1) Filtering: It's a good strategy to filter out the rarest words whenever appropriate! However, as the definition of rare might vary from case to case, be careful not to slash out too much!
2) Smoothing: A more structural approach to remedy the bias is to smooth the word frequency distribution. The simplest method is called Laplace smoothing (sometimes called k-smoothing in the context of PMI). It adds a constant (integer-)value k to the raw frequencies before computing the relative frequencies. The intuition behind this is that while adding k to the counts of frequent words, it won't push their relative frequency by a large magnitude, whereas for rare words, the push is substantial and reduces the PMI bias. Finding suitable values for k depends on your corpus and thus might require some iterative tinkering.
3) Using Normalized-PMI: Besides the rarity bias, PMI values are also not bounded, meaning that they can take arbitrarily large values, which sometimes makes them hard to interpret. The normalized PMI addresses both issues by adding a normalization factor that a) penalizes rare word pairs and b) bounds the PMI value to range from -1 to +1. The core intuition behind the normalization factor $-log(P(x,y))$ is that it represents the surprise to encounter two words $x$ and $y$. For rare word pairs, this value is very high, and hence when we divide the normal PMI by it (PMI(x, y) / -log(P(x, y))), the resulting normalized score will be small.

Best,
Lennart

Questions and Discussions

First Task PMI

Re: First Task PMI