Zum Hauptinhalt
WueCampus
  • Useful links
    Course search
    Course search
    Rechenzentrum
    Rechenzentrum
    Frequently asked questions
    Frequently asked questions
    Education digital
    Education digital
    Research digital
    Research digital
    Lecture - Videoupload
    Lecture - Videoupload
    CaseTrain
    CaseTrain
    Toolbox
    Toolbox
  • Kalender
  • Mehr
Deutsch ‎(de)‎
Català ‎(ca)‎ Deutsch ‎(de_kids)‎ Deutsch ‎(de_wp)‎ Deutsch ‎(de)‎ Deutsch (du) ‎(de_du)‎ English ‎(en)‎ Español - Internacional ‎(es)‎ Français ‎(fr)‎ Italiano ‎(it)‎ Português - Portugal ‎(pt)‎ Svenska ‎(sv)‎ Türkçe ‎(tr)‎ Русский ‎(ru)‎ العربية ‎(ar)‎
Sie sind als Gast angemeldet
Login
WueCampus
Kalender Useful links Einklappen Ausklappen
Course search Rechenzentrum Frequently asked questions Education digital Research digital Lecture - Videoupload CaseTrain Toolbox
Course request Einklappen Ausklappen
Course request: wt 25/26 Course request: st 25 non-term
Alles aufklappen Alles einklappen
  1. Startseite
  2. WS23_DS4DH1
  3. Part 3: Data modeling for data science
  4. Exercise: Numpy and Bag-of-words model

Exercise: Numpy and Bag-of-words model

Abschlussbedingungen
Geöffnet: Dienstag, 21. November 2023, 00:00
Fällig: Dienstag, 28. November 2023, 00:00
E2

Exercise 2: Scientific programming in Python with Numpy - Bag-of-words model

Bag-of-words model

The bag-of-words model is the basis for many methods of quantitative text analysis and machine learning. A collection of texts is represented as a so-called document-term matrix. A single text is represented as a vector of word frequencies and represents one row of the matrix. Thus, the document term matrix has the shape (number of texts, number of words). The entries of each vector are the absolute frequency of the word in the text in question.

Example: Given both texts:

t1 = "Ich baue mir ein Haus"
t2 = "Ich streiche mein Haus rot"

The document-term matrix look like:

Ich Haus baue mir ein streiche mein rot
1 1 1 1 1 0 0 0
1 1 0 0 0 1 1 1

Task 1

Implement the bag-of-words model in a class called BagOfWords.

This class is initialized with a function for tokenizing texts with the parameter tokenizer in the constructor. As a default value for this parameter, a suitable function shall be chosen. Additionally, there should also be a parameter max_words that determines how many words should be considered for the document term matrix. The default value of this parameter shall be None, meaning that all words of all texts should be considered. If you pass a number n to this parameter, only the n most frequent words (measured by their frequency in all texts) shall be included in the document term matrix. Also, the encode-method shall return a list of the vocabulary of the document-term-matrix, in the correct order.

Example: Given the configuration:

bow = BagOfWords(max_words=2)

t1 = "ich ich ich du du ein"
t2 = "ich du ein"
t3 = "ich ich du"
dtm, types = bow.encode([t1, t2])

print(types)
print('####')
print(dtm)

Such that the output is:

np.array(['ich', 'du'])
####
np.array(
    [[3 2],
    [1 1],
    [2, 1]]
)

Tip

For counting all sorts of things the Counter-class of Python’s built-in collection-module can be helpful.

Also, it might be helpful to have a data structure that keeps track of the position (i.e., column-index in the document-term-matrix) of each token in your vocabulary.

Task 2

In the WueCampus you’ll find a collection of german novels. Read the novels’ content using python.

Preferably, you’ll use the handy Path class of Pythons pathlib module.

from pathlib import Path
text_files = list(Path("Romane").glob('*.txt')) # You can reuse this list later to reference texts!
texts = [file.read_text() for file in text_files]

Afterward, use your BagOfWord-class to create a document term matrix containing all words of the collection.

Use this matrix and numpy operations to:

  • Find the most common word in the collection
  • Find the longest and the shortest text.
  • Find the text with the smallest vocabulary.
  • Compute relative word frequencies, normalized by the text lengths.

Pro:

  • Find the word whose relative frequency varies the most across the collection.
  • Find all words that occur in less than five texts.
  • Mask all words in each text that occur less often than the mean word frequency of the text.
  • Romane Romane
    12. Oktober 2023, 13:56
◄ Notebook: Basics 2 (HTML-Version)
Notebook: Numpy (HTML-Version) ►

Impressum  |  Kontakt  |  Datenschutzerklärung - WueCampus  |  Erklärung zur Barrierefreiheit  |  Bildnachweise

Navigationsleiste - WueStudy: University icons created by justicon - Flaticon
Navigationsleiste - Rechenzentrum: Data center icons created by Eucalyp - Flaticon
Navigationsleiste - Website Support: Consultant icons created by Vitaly Gorbachev - Flaticon
Navigationsleiste - Häufige Fragen: Files and folders icons created by Freepik - Flaticon
Navigationsleiste - Lehre Digital: Training icons created by vectorspoint - Flaticon
Navigationsleiste - Forschung Digital: Research icons created by Eucalyp - Flaticon
Navigationsleiste - Lecture: Video icons created by Freepik - Flaticon
Navigationsleiste - Toolbox: Toolbox icons created by Freepik - Flaticon