Exercise 2: Scientific programming in Python with `Numpy` - Bag-of-words model

Bag-of-words model

The bag-of-words model is the basis for many methods of quantitative text analysis and machine learning. A collection of texts is represented as a so-called document-term matrix. A single text is represented as a vector of word frequencies and represents one row of the matrix. Thus, the document term matrix has the shape (number of texts, number of words). The entries of each vector are the absolute frequency of the word in the text in question.

Example: Given both texts:

t1 = "Ich baue mir ein Haus"
t2 = "Ich streiche mein Haus rot"

The document-term matrix look like:

Ich	Haus	baue	mir	ein	streiche	mein	rot
1	1	1	1	1	0	0	0
1	1	0	0	0	1	1	1

Task 1

Implement the bag-of-words model in a class called BagOfWords.

This class is initialized with a function for tokenizing texts with the parameter tokenizer in the constructor. As a default value for this parameter, a suitable function shall be chosen. Additionally, there should also be a parameter max_words that determines how many words should be considered for the document term matrix. The default value of this parameter shall be None, meaning that all words of all texts should be considered. If you pass a number n to this parameter, only the n most frequent words (measured by their frequency in all texts) shall be included in the document term matrix. Also, the encode-method shall return a list of the vocabulary of the document-term-matrix, in the correct order.

Example: Given the configuration:

bow = BagOfWords(max_words=2)

t1 = "ich ich ich du du ein"
t2 = "ich du ein"
t3 = "ich ich du"
dtm, types = bow.encode([t1, t2])

print(types)
print('####')
print(dtm)

Such that the output is:

np.array(['ich', 'du'])
####
np.array(
    [[3 2],
    [1 1],
    [2, 1]]
)

Tip

For counting all sorts of things the Counter-class of Python’s built-in collection-module can be helpful.

Also, it might be helpful to have a data structure that keeps track of the position (i.e., column-index in the document-term-matrix) of each token in your vocabulary.

Task 2

In the WueCampus you’ll find a collection of german novels. Read the novels’ content using python.

Preferably, you’ll use the handy Path class of Pythons pathlib module.

from pathlib import Path
text_files = list(Path("Romane").glob('*.txt')) # You can reuse this list later to reference texts!
texts = [file.read_text() for file in text_files]

Afterward, use your BagOfWord-class to create a document term matrix containing all words of the collection.

Use this matrix and numpy operations to:

Find the most common word in the collection
Find the longest and the shortest text.
Find the text with the smallest vocabulary.
Compute relative word frequencies, normalized by the text lengths.

Pro:

Find the word whose relative frequency varies the most across the collection.
Find all words that occur in less than five texts.
Mask all words in each text that occur less often than the mean word frequency of the text.

Romane
12. Oktober 2023, 13:56

Exercise 2: Scientific programming in Python with Numpy - Bag-of-words model

Bag-of-words model

Task 1

Tip

Task 2

Exercise 2: Scientific programming in Python with `Numpy` - Bag-of-words model