Exercise: Numpy and Bag-of-words model
Exercise
2: Scientific programming in Python with Numpy
-
Bag-of-words model
Bag-of-words model
The bag-of-words model is the basis for many methods of quantitative text analysis and machine learning. A collection of texts is represented as a so-called document-term matrix. A single text is represented as a vector of word frequencies and represents one row of the matrix. Thus, the document term matrix has the shape (number of texts, number of words). The entries of each vector are the absolute frequency of the word in the text in question.
Example: Given both texts:
= "Ich baue mir ein Haus"
t1 = "Ich streiche mein Haus rot" t2
The document-term matrix look like:
Ich | Haus | baue | mir | ein | streiche | mein | rot |
---|---|---|---|---|---|---|---|
1 | 1 | 1 | 1 | 1 | 0 | 0 | 0 |
1 | 1 | 0 | 0 | 0 | 1 | 1 | 1 |
Task 1
Implement the bag-of-words model in a class called
BagOfWords
.
This class is initialized with a function for tokenizing texts with
the parameter tokenizer
in the constructor. As a default
value for this parameter, a suitable function shall be chosen.
Additionally, there should also be a parameter max_words
that determines how many words should be considered for the document
term matrix. The default value of this parameter shall be
None
, meaning that all words of all texts should be
considered. If you pass a number n to this parameter, only the n most frequent words (measured by
their frequency in all texts) shall be included in the document term
matrix. Also, the encode
-method shall return a list of the
vocabulary of the document-term-matrix, in the correct order.
Example: Given the configuration:
= BagOfWords(max_words=2)
bow
= "ich ich ich du du ein"
t1 = "ich du ein"
t2 = "ich ich du"
t3 = bow.encode([t1, t2])
dtm, types
print(types)
print('####')
print(dtm)
Such that the output is:
'ich', 'du'])
np.array([####
np.array(3 2],
[[1 1],
[2, 1]]
[ )
Tip
For counting all sorts of things the Counter
-class of
Python’s built-in collection
-module can be helpful.
Also, it might be helpful to have a data structure that keeps track of the position (i.e., column-index in the document-term-matrix) of each token in your vocabulary.
Task 2
In the WueCampus you’ll find a collection of german novels. Read the novels’ content using python.
Preferably, you’ll use the handy Path
class of Pythons
pathlib
module.
from pathlib import Path
= list(Path("Romane").glob('*.txt')) # You can reuse this list later to reference texts!
text_files = [file.read_text() for file in text_files] texts
Afterward, use your BagOfWord-class to create a document term matrix containing all words of the collection.
Use this matrix and numpy
operations to:
- Find the most common word in the collection
- Find the longest and the shortest text.
- Find the text with the smallest vocabulary.
- Compute relative word frequencies, normalized by the text lengths.
Pro:
- Find the word whose relative frequency varies the most across the collection.
- Find all words that occur in less than five texts.
- Mask all words in each text that occur less often than the mean word frequency of the text.
- 12. Oktober 2023, 13:56