Data Science for Humanities 2¶

Session: Corpus Linguistics¶

Summer term 25¶

Prof. Goran Glavaš, Lennart Keller¶

Multi-word expressions, collocations, idioms¶

We can think of language as a composite of two things:

  • Lexicon: A collection of units containing that we use to reference and describe
    • Entities
    • States
    • Transitions
  • Grammar: A set of rules to bring the elements of the vocabulary in a meaningful way

Lexical semantics: modeling/capturing the meaning of words




Usually, we assume that the atomic lexicon units are words, but some semantic units span over multiple terms.

Lexical association measures help us find words that frequently occur together.

But how can we describe those word sequences?

Multi-word expressions¶

Semantic units spanning multiple words are called multi-word expressions (MEW).

[Bender and Lascarides, 2020] broadly define MWEs as:

"collections of words which co-occur and are idiosyncratic in one or more aspects of their form or meaning"

Forms of MEW include expressions that:

  • Words that co-occur much more frequently than by chance (i.e., they are lexically associated)
    • Examples: [killer whale], [flight attendant], [cruise missile]
  • Terms that, in their conjuncture, disobey grammatical or syntactical rules
    • Examples: [long time no see], [all of a sudden]
  • Words whose combined meaning is more than the sum of its parts
    • Examples: [greenhouse effect], [jump the shark]

Collocations¶

Sub-type of MEWs.

(Choueka, 1988)

[A collocation is defined as] “a sequence of two or more consecutive words, that has characteristics of a syntactic and semantic unit, and whose exact and unambiguous meaning or connotation cannot be derived directly from the meaning or connotation of its components."

Criteria:

  • non-compositionality
  • non-substitutability
  • non-modifiability

Collocations are words that frequently appear together and display a semantic association.

Examples: [strong coffee], [make progress], [do homework], [from my point of view]

Collocations may specify word senses¶

Collocations tend to make word senses less ambiguous.

Example: heavy has many senses, and its precise meaning is often defined by its adjacent noun.

  • He is a [heavy drinker].

  • The final punch dealt them a [heavy blow].

Non-Compositionality¶

The meaning of compositional phrases can be predicted from the meaning of their parts.

Collocations have limited compositionality:

[bull market] $\neq$ [bull] + [market]

Idioms¶

Idioms are collocations with the highest degree of non-compositionality. They display a figurative and non-literal meaning.

Metaphors, metonymy, or other figurative devices convey their sense.

Examples:

  • [It struck me] [out of the blue]; the company was [behind all of] the accidents.
  • [Der Teufel steckt im Detail].

Locality of idioms

Idioms are often specific to a particular region or dialect.

  • [Butter bei die Fische]
  • [Grüß Gott]
  • ...

Non-Substitutability¶

A good way to test if a MWE is a collocation is to exchange one of its parts with a synonym and check if it sounds off:

  • Collocation: [strong coffee] $\rightarrow$ [powerful coffee] 😵‍💫
  • Non-collocation: [bad weather] $\rightarrow$ [poor weather] 😊

Since collocations are language specific, their correct usage indicates fluency in a language.

Often, incorrect usage of collocations is why non-native speakers express themselves in a non-natural way.

  • [Ein Beispiel für] $\rightsquigarrow$ [An example for]
  • [Ein Beispiel für] $\rightarrow$ [An example of]

Non-modifiability¶

Many collocations cannot be freely modified with additional lexical material or through grammatical transformations

  • [weapons of mass destruction] $\neq$ [weapons of massive destruction]
  • [to be fed up to the back teeth] $\neq$ [to be fed up to the teeth in the back]

Lexical association measures¶

Goal: Find words frequently occurring in the same context (-> collocations).

Finding those word pairs is often framed as a statistical test:

  • $H0$: Words $x, y$ aren't related.
  • $H1$: Words $x, y$ are related.

$\rightarrow$ Goal: Find all pairs $(x, y)$ for which $H1$ is more likely than $H0$


While there are numerous tests available, today we'll focus only on PMI.

Pointwise Mutual Information (PMI)¶

Church and Hanks, 1989

$$ PMI(x, y) = \log_2\frac{P(x, y)}{P(x) \cdot P(y)} $$

Intuition: How much more likely do two words co-occur together than we would expect by chance?




Let's break it down:

$P(x)/ P(y) \rightarrow$ Probability of word x/ y occurring in our corpus.


Denominator $P(x) \cdot P(y) \rightarrow$ Estimated joint probability of x and y co-occuring under the assumption that they are statistically independent.


Numerator: $P(x, y) \rightarrow$ Observed joint probability of word x and y occurring together in our corpus.


Log-Part: $\log_b (\frac{x}{y}) = \log_b x - \log_b y$

$\rightarrow$ Positive values indicate these words occur more frequently than we would expect by chance!


Co-occurence¶

How to define together?

Co-occurence is loosely defined; in theory, we can freely express our notion of togetherness.

For example, words that appear in the same:

  • phrase
  • sentence
  • paragraph
  • window of $n$ words
  • ...

Another option is to define together as a collocation, i.e., words must be adjacent.


PMI Properties¶

PMI values can range from $-\infty$ to $\infty$.

Positive values indicate that $x$ and $y$ co-occur more often than we would expect.

Interpretation of negative PMI-value is much vaguer: These occur less often than expected.

$\rightarrow$ Does it mean they are more than unrelated?

Can we draw any conclusion based on this information?

You shall know a word by the company it keeps.
(Firth's distributional hypothesis)

But also on the company it rejects?

Empirical observation: Negative PMI values are harmful to downstream applications.

To avoid this negative influence Positive Pointwise Mutual Information (PPMI) is used.

$$ PPMI(x, y) = max(PMI(x, y), 0) $$

Spatial detection of collocations¶

Intuition:

Words that often or even only occur at the same distance from one another are candidates for collocations.

How to find those pairs?

Use variance: Word pairs with low variance in offset might be collocations.

Offset: Signed distance from each other

$$ \frac{\sum_{i=1}^{n} d_i - \bar{d}}{n - 1} $$

  • $d_i \rightarrow$ Offset of word pair $i$
  • $\bar{d} \rightarrow$ Mean of all offsets
  • $n \rightarrow$ Number of different word pairs

Lexico-semantic resources: WordNet, BabelNet, PanLex¶

So far, we only learned to extract possible candidates for MWEs from texts.

But to achieve this, we only rely on computing word or co-occurence frequencies, telling us nothing about the meaning of the terms we extract.

In general, any statistical method to extract relations from texts cannot infer the nature of the relatedness; it can only tell us that there is any relationship between two terms.

Consider the following examples:

"You were [right]", I said, "We should have turned [right] three blocks ago."
After [serving] four years as ambassador to Germany, she returned to Louisville.
The night before her departure, she invited former colleagues to a well-known restaurant in Berlin, which is famous for [serving] Wiener Schnitzel.

It's obvious that right and to serve have two different senses in both utterances.

Depending on our corpus, association measures could reveal to us that "be right" and "turn right" might be strongly related. Still, even with modern, context-aware techniques like Bert, we couldn't ground these two senses in the appropriate senses.

WordNet - Capturing word senses¶

We have to rely on external and manually created resources to infer word senses.

WordNet is the most famous example of a lexico-semantic resource. (And it's freely available for everyone!)

It contains ~ 120.000 nouns, ~22.000 adjectives, ~ 12.000 verbs, and ~4.000 adverbs in English, and stores one or multiple senses for each entry.

Furthermore, WordNet stores lexicographic relations among words and groups words into lexicographic categories.

The foundational relation in WordNet is the synonym relation.

Synsets¶

Words are stored in synsets clustering near-synonymous terms into the same groups.

Additionally, Synsets are labeled into lexigraphic categories, like ANIMAL, PERSON, QUANTITY, ...

These categories are called supersenses.

Sense Relations¶

WordNet connects different synsets via edges that represent a specific category labeling their type of semantic relatedness.

Some examples for noun relations:

Relation Definition Example
Hypernym Concept to superordinates breakfast -> meal
Hyponym Concept to subtype meal -> lunch
Instance [Hyper-/Hypo]nym ... Goethe -> author
Part Meronym From whole to parts forest -> tree
Part Holonym From parts to whole soldier -> army
Antonym Semantic opposition cold <-> warm
... ... ...

WordNet - A lexico-semantic Knowledge Graph¶

In its entirety WordNet can be seen as a knowledge graph bridging English words via various lexico-semantic links.

WNGraph.png

How to use WordNet?¶

WordNet is integrated into the Natural Language Toolkit - a legacy Python NLP library.

In [1]:
from nltk.corpus import wordnet as wn

term, pos = "love", "n"
synsets = wn.synsets(term, pos)
for idx, synset in enumerate(synsets):
    print(f"{idx + 1}. {term}: {synset.definition()}")
    print(f"Synset: {synset}")
    print(f"Examples: {synset.examples()}")
    print(f"Hyponyms: {synset.hyponyms()[:5]}")
    print(f"Hypernyms: {synset.hypernyms()[:5]}")
    print("_"*30 + "\n")

newline = "\n"
print(f"Synomns for {term}: {newline.join([' '.join(s) for s in wn.synonyms(term) if s])}")
1. love: a strong positive emotion of regard and affection
Synset: Synset('love.n.01')
Examples: ['his love for his work', 'children need a lot of love']
Hyponyms: [Synset('agape.n.01'), Synset('agape.n.02'), Synset('amorousness.n.01'), Synset('ardor.n.02'), Synset('benevolence.n.01')]
Hypernyms: [Synset('emotion.n.01')]
______________________________

2. love: any object of warm affection or devotion
Synset: Synset('love.n.02')
Examples: ['the theater was her first love', 'he has a passion for cock fighting']
Hyponyms: []
Hypernyms: [Synset('object.n.04')]
______________________________

3. love: a beloved person; used as terms of endearment
Synset: Synset('beloved.n.01')
Examples: []
Hyponyms: []
Hypernyms: [Synset('lover.n.01')]
______________________________

4. love: a deep feeling of sexual desire and attraction
Synset: Synset('love.n.04')
Examples: ['their love left them indifferent to their surroundings', 'she was his first love']
Hyponyms: []
Hypernyms: [Synset('sexual_desire.n.01')]
______________________________

5. love: a score of zero in tennis or squash
Synset: Synset('love.n.05')
Examples: ['it was 40 love']
Hyponyms: []
Hypernyms: [Synset('score.n.03')]
______________________________

6. love: sexual activities (often including sexual intercourse) between two people
Synset: Synset('sexual_love.n.02')
Examples: ['his lovemaking disgusted her', "he hadn't had any love in months", 'he has a very complicated love life']
Hyponyms: []
Hypernyms: [Synset('sexual_activity.n.01')]
______________________________

Synomns for love: passion
beloved dear dearest honey
erotic_love sexual_love
love_life lovemaking making_love sexual_love
enjoy
bang be_intimate bed bonk do_it eff fuck get_it_on get_laid have_a_go_at_it have_intercourse have_it_away have_it_off have_sex hump jazz know lie_with make_love make_out roll_in_the_hay screw sleep_together sleep_with
In [2]:
# Get the the similarity between synsets to measure their semantic relatedness.
# The similarity is based on the shortest path between synsets.
synset_a = wn.synsets("game", "n")[0]
synset_b = wn.synsets("win", "n")[0]
synset_c = wn.synsets("newspaper", "n")[0]

print("Definitions:")
print(*map(lambda s: f"{s.lemmas()[0].name()}: {s.definition()}", (synset_a, synset_b, synset_c)), sep="\n")
print()
print("Similarities:")
print(f"game -> win {synset_a.path_similarity(synset_b)}")
print(f"game -> newspaper {synset_a.path_similarity(synset_c)}")
Definitions:
game: a contest with rules to determine a winner
win: a victory (as in a race or other competition)
newspaper: a daily or weekly publication on folded sheets; contains news and articles and advertisements

Similarities:
game -> win 0.125
game -> newspaper 0.0625
In [3]:
# ONLY FOR EDUCATIONAL PURPOSES; DO NOT USE IT AS IS FOR ANYTHING!

# A naive function that maps each word in a sentence to its hypernym,
# to obtain a more abstract (and less precise) version of the sentence.
def get_hypernyms(sentence, n):
    """
    Gets the nth hypernym for each noun, verb or adjective in the sentence.
    """
    hypernym_sent = []
    for token, pos in sentence:
        # WordNet does not contain hypernyms for adjectives
        if pos is None or pos == wn.ADJ:
            hypernym_sent.append((token, pos))
        else:
            synset = wn.synsets(token, pos)[0]
            hypernyms = list(synset.closure(lambda s: s.hypernyms()))
            if hypernyms:
                try:
                    hypernym = hypernyms[n]
                    token = hypernym.lemmas()[0].name()
                except IndexError:
                    hypernym = hypernyms[0]
                    token = hypernym.lemmas()[0].name()
            hypernym_sent.append((token, pos))
            
    return hypernym_sent

sentence_a = [("I", None), ("adore", wn.VERB), ("my", None), ("dogs", wn.NOUN), ("and", None), ("cats", wn.NOUN)]
sentence_b = [("He", None), ("was", wn.VERB), ("attacked", wn.VERB), ("by", None), ("a", None), ("lion", wn.NOUN)]

print(get_hypernyms(sentence_a, n=-13))
print(get_hypernyms(sentence_b, n=-13))
[('I', None), ('love', 'v'), ('my', None), ('domestic_animal', 'n'), ('and', None), ('feline', 'n')]
[('He', None), ('was', 'v'), ('contend', 'v'), ('by', None), ('a', None), ('feline', 'n')]
/opt/homebrew/Caskroom/miniconda/base/envs/python_intro/lib/python3.11/site-packages/nltk/corpus/reader/wordnet.py:604: UserWarning: Discarded redundant search for Synset('animal.n.01') at depth 7
  for synset in acyclic_breadth_first(self, rel, depth):

Going beyond English: BabelNet¶

BabelNetWeb.png

Basics¶

  • BabelNet is a large-scale semantic network and multilingual lexicalized knowledge base.
  • It combines resources like WordNet, Wikipedia, and OmegaWiki, covering 284 languages.

Multilingualism¶

  • BabelNet's goal is to align concepts and senses across languages.
  • In doing so, it offers translations and cross-lingual semantic relationships.
Semantic Network¶
  • Structured as a graph (very similar to WordNet)
  • Each node represent concepts (synsets) or named entities.
  • Edges represent semantic relationships (e.g., hypernymy, meronymy).

PanLex - Preversing endangered languages¶

Basics¶

  • Large-scale multilingual lexical database, covering over 17,000 languages and dialects.
  • PanLex contains around 1.3 billion translation pairs.

Comprehensive Language Coverage¶

  • Goal: Represent both major and minor languages, with a particular focus on supporting endangered and lesser-known languages.

Translation Pairs¶

  • Direct translations between languages.
  • Indirect translations through intermediary languages.