SS25_DS4DH2: First Task PMI

Hey,

I'm currently working on the PMI task. First of all, I'm assuming you want us to code the calculation by ourselves, right? Because I noticed there's also a method in the NLTK package, so I hope I didn't do a lot of unneccessary work :D

Anyway, I noticed that because of the way the PMI is calculated, if I calculate the PMI for two words that each only appear once in my corpus, their PMI values are going to be the highest values in my results (unless I'm mistaken in the way I'm calculating). Is there are way to circumvent that to get meaningful results? Should I just eliminate all the words that only appear once from my calculations?

Re: First Task PMI

por Anne Schmid - sexta-feira, 16 de maio de 2025 às 15:04

ETA: I chose to eliminate collocations unless both words appeared more than two times (corpus related, there was a lot of "Whoa! Whoa!"). So my question is solved, but maybe someone thought about the same thing.

Re: First Task PMI

por Jan Keller - sexta-feira, 16 de maio de 2025 às 18:27

Hey Anne,
Yes, we want you to implement it from scratch because this is also an excellent opportunity to learn (or repeat) the absolute basics of NLP (string-preprocessing, tokenization, word-counting, filtering, etc.) :)
--
Regarding your PMI "issue": That's a great observation, and a PMI's bias towards infrequent events is a well-known deficiency of this metric!
There exist several workarounds to alleviate this:
1) Filtering: It's a good strategy to filter out the rarest words whenever appropriate! However, as the definition of rare might vary from case to case, be careful not to slash out too much!
2) Smoothing: A more structural approach to remedy the bias is to smooth the word frequency distribution. The simplest method is called Laplace smoothing (sometimes called k-smoothing in the context of PMI). It adds a constant (integer-)value k to the raw frequencies before computing the relative frequencies. The intuition behind this is that while adding k to the counts of frequent words, it won't push their relative frequency by a large magnitude, whereas for rare words, the push is substantial and reduces the PMI bias. Finding suitable values for k depends on your corpus and thus might require some iterative tinkering.
3) Using Normalized-PMI: Besides the rarity bias, PMI values are also not bounded, meaning that they can take arbitrarily large values, which sometimes makes them hard to interpret. The normalized PMI addresses both issues by adding a normalization factor that a) penalizes rare word pairs and b) bounds the PMI value to range from -1 to +1. The core intuition behind the normalization factor $-log(P(x,y))$ is that it represents the surprise to encounter two words $x$ and $y$. For rare word pairs, this value is very high, and hence when we divide the normal PMI by it (PMI(x, y) / -log(P(x, y))), the resulting normalized score will be small.

Best,
Lennart