We can think of language as a composite of two things:
Lexical semantics: modeling/capturing the meaning of words
Usually, we assume that the atomic lexicon units are words, but some semantic units span over multiple terms.
Lexical association measures help us find words that frequently occur together.
But how can we describe those word sequences?
Semantic units spanning multiple words are called multi-word expressions (MEW).
[Bender and Lascarides, 2020] broadly define MWEs as:
"collections of words which co-occur and are idiosyncratic in one or more aspects of their form or meaning"
Forms of MEW include expressions that:
Sub-type of MEWs.
(Choueka, 1988)
[A collocation is defined as] “a sequence of two or more consecutive words, that has characteristics of a syntactic and semantic unit, and whose exact and unambiguous meaning or connotation cannot be derived directly from the meaning or connotation of its components."
Criteria:
Collocations are words that frequently appear together and display a semantic association.
Examples: [strong coffee], [make progress], [do homework], [from my point of view]
Collocations tend to make word senses less ambiguous.
Example: heavy has many senses, and its precise meaning is often defined by its adjacent noun.
He is a [heavy drinker].
The final punch dealt them a [heavy blow].
The meaning of compositional phrases can be predicted from the meaning of their parts.
Collocations have limited compositionality:
[bull market] $\neq$ [bull] + [market]
Idioms are collocations with the highest degree of non-compositionality. They display a figurative and non-literal meaning.
Metaphors, metonymy, or other figurative devices convey their sense.
Examples:
Locality of idioms
Idioms are often specific to a particular region or dialect.
A good way to test if a MWE is a collocation is to exchange one of its parts with a synonym and check if it sounds off:
Since collocations are language specific, their correct usage indicates fluency in a language.
Often, incorrect usage of collocations is why non-native speakers express themselves in a non-natural way.
Many collocations cannot be freely modified with additional lexical material or through grammatical transformations
Goal: Find words frequently occurring in the same context (-> collocations).
Finding those word pairs is often framed as a statistical test:
$\rightarrow$ Goal: Find all pairs $(x, y)$ for which $H1$ is more likely than $H0$
While there numerous text available, today we'll focus only on PMI.
Intuition: How much more likely do two words co-occur together than we would expect by chance?
Let's break it down:
$P(x)/ P(y) \rightarrow$ Probability of word x/ y occurring in our corpus.
Denominator $P(x) \cdot P(y) \rightarrow$ Estimated joint probability of x and y co-occuring under the assumption that they are statistically independent.
Numerator: $P(x, y) \rightarrow$ Observed joint probability of word x and y occurring together in our corpus.
Log-Part: $\log_b (\frac{x}{y}) = \log_b x - \log_b y$
$\rightarrow$ Positive values indicate these words occur more frequently than we would expect by chance!
How to define together?
Co-occurence is loosely defined; in theory, we can freely express our notion of togetherness.
For example, words that appear in the same:
Another option is to define together as a collocation, i.e., words must be adjacent.
PMI values can range from $-\infty$ to $\infty$.
Positive values indicate that $x$ and $y$ co-occur more often than we would expect.
</br>
Interpretation of negative PMI-value is much vaguer: These occur less often than expected.
$\rightarrow$ Does it mean they are more than unrelated?
Can we draw any conclusion based on this information?
You shall know a word by the company it keeps.
(Firth's distributional hypothesis)
But also on the company it rejects?
Empirical observation: Negative PMI values are harmful to downstream applications.
To avoid this negative influence Positive Pointwise Mutual Information (PPMI) is used.
$$ PPMI(x, y) = max(PMI(x, y), 0) $$Intuition:
Words that often or even only occur at the same distance from one another are candidates for collocations.
How to find those pairs?
Use variance: Word pairs with low variance in offset might be collocations.
Offet: Signed distance from each other
$$ \frac{\sum_{i=1}^{n} d_i - \bar{d}}{n - 1} $$So far, we only learned to extract possible candidates for MWEs from texts.
But to achieve this, we only rely on computing word or co-occurence frequencies, telling us nothing about the meaning of the terms we extract.
In general, any statistical method to extract relations from texts cannot infer the nature of the relatedness; it can only tell us that there is any relationship between two terms.
Consider the following examples:
"You were [right]", I said, "We should have turned [right] three blocks ago."
After [serving] four years as ambassador to Germany, she returned to Louisville.
The night before her departure, she invited former colleagues to a well-known restaurant famous for [serving] Wiener Schnitzel.
It's obvious that right and to serve have two different senses in both utterances.
Depending on our corpus, association measures could reveal to us that "be right" and "turn right" might be strongly related. Still, even with modern, context-aware techniques like Bert, we couldn't ground these two senses in the appropriate senses.
We have to rely on external and manually created resources to infer word senses.
WordNet is the most famous example of a lexico-semantic resource. (And it's freely available for everyone!)
It contains ~ 120.000 nouns, ~22.000 adjectives, ~ 12.000 verbs, and ~4.000 adverbs in English, and stores one or multiple senses for each entry.
Furthermore, WordNet stores lexicographic relations among words and groups words into lexicographic categories.
The foundational relation in WordNet is the synonym relation.
Words are stored in synsets clustering near-synonymous terms into the same groups.
Additionally, Synsets are labeled into lexigraphic categories, like ANIMAL
, PERSON
, QUANTITY
, ...
These categories are called supersenses.
WordNet connects different synsets via edges that represent a specific category labeling their type of semantic relatedness.
Some examples for noun relations:
Relation | Definition | Example |
---|---|---|
Hypernym | Concept to superordinates | breakfast -> meal |
Hyponym | Concept to subtype | meal -> lunch |
Instance [Hyper-/Hypo]nym | ... | Goethe -> author |
Part Meronym | From whole to parts | forest -> tree |
Part Holonym | From parts to whole | soldier -> army |
Antonym | Semantic opposition | cold <-> warm |
... | ... | ... |
In its entirety WordNet can be seen as a knowledge graph bridging English words via various lexico-semantic links.
WordNet is integrated into the Natural Language Toolkit - a legacy Python NLP library.
from nltk.corpus import wordnet as wn
term, pos = "love", "n"
synsets = wn.synsets(term, pos)
for idx, synset in enumerate(synsets):
print(f"{idx + 1}. {term}: {synset.definition()}")
print(f"Synset: {synset}")
print(f"Examples: {synset.examples()}")
print(f"Hyponyms: {synset.hyponyms()[:5]}")
print(f"Hypernyms: {synset.hypernyms()[:5]}")
print("_"*30 + "\n")
newline = "\n"
print(f"Synomns for {term}: {newline.join([' '.join(s) for s in wn.synonyms(term) if s])}")
1. love: a strong positive emotion of regard and affection Synset: Synset('love.n.01') Examples: ['his love for his work', 'children need a lot of love'] Hyponyms: [Synset('agape.n.01'), Synset('agape.n.02'), Synset('amorousness.n.01'), Synset('ardor.n.02'), Synset('benevolence.n.01')] Hypernyms: [Synset('emotion.n.01')] ______________________________ 2. love: any object of warm affection or devotion Synset: Synset('love.n.02') Examples: ['the theater was her first love', 'he has a passion for cock fighting'] Hyponyms: [] Hypernyms: [Synset('object.n.04')] ______________________________ 3. love: a beloved person; used as terms of endearment Synset: Synset('beloved.n.01') Examples: [] Hyponyms: [] Hypernyms: [Synset('lover.n.01')] ______________________________ 4. love: a deep feeling of sexual desire and attraction Synset: Synset('love.n.04') Examples: ['their love left them indifferent to their surroundings', 'she was his first love'] Hyponyms: [] Hypernyms: [Synset('sexual_desire.n.01')] ______________________________ 5. love: a score of zero in tennis or squash Synset: Synset('love.n.05') Examples: ['it was 40 love'] Hyponyms: [] Hypernyms: [Synset('score.n.03')] ______________________________ 6. love: sexual activities (often including sexual intercourse) between two people Synset: Synset('sexual_love.n.02') Examples: ['his lovemaking disgusted her', "he hadn't had any love in months", 'he has a very complicated love life'] Hyponyms: [] Hypernyms: [Synset('sexual_activity.n.01')] ______________________________ Synomns for love: passion beloved dear dearest honey erotic_love sexual_love love_life lovemaking making_love sexual_love enjoy bang be_intimate bed bonk do_it eff fuck get_it_on get_laid have_a_go_at_it have_intercourse have_it_away have_it_off have_sex hump jazz know lie_with make_love make_out roll_in_the_hay screw sleep_together sleep_with
# Get the the similarity between synsets to measure their semantic relatedness.
# The similarity is based on the shortest path between synsets.
synset_a = wn.synsets("game", "n")[0]
synset_b = wn.synsets("win", "n")[0]
synset_c = wn.synsets("newspaper", "n")[0]
print("Definitions:")
print(*map(lambda s: f"{s.lemmas()[0].name()}: {s.definition()}", (synset_a, synset_b, synset_c)), sep="\n")
print()
print("Similarities:")
print(f"game -> win {synset_a.path_similarity(synset_b)}")
print(f"game -> newspaper {synset_a.path_similarity(synset_c)}")
Definitions: game: a contest with rules to determine a winner win: a victory (as in a race or other competition) newspaper: a daily or weekly publication on folded sheets; contains news and articles and advertisements Similarities: game -> win 0.125 game -> newspaper 0.0625
# ONLY FOR EDUCATIONAL PURPOSES; DO NOT USE IT AS IS FOR ANYTHING!
# A naive function that maps each word in a sentence to its hypernym,
# to obtain a more abstract (and less precise) version of the sentence.
def get_hypernyms(sentence, n):
"""
Gets the nth hypernym for each noun, verb or adjective in the sentence.
"""
hypernym_sent = []
for token, pos in sentence:
# WordNet does not contain hypernyms for adjectives
if pos is None or pos == wn.ADJ:
hypernym_sent.append((token, pos))
else:
synset = wn.synsets(token, pos)[0]
hypernyms = list(synset.closure(lambda s: s.hypernyms()))
if hypernyms:
try:
hypernym = hypernyms[n]
token = hypernym.lemmas()[0].name()
except IndexError:
hypernym = hypernyms[0]
token = hypernym.lemmas()[0].name()
hypernym_sent.append((token, pos))
return hypernym_sent
sentence_a = [("I", None), ("adore", wn.VERB), ("my", None), ("dogs", wn.NOUN), ("and", None), ("cats", wn.NOUN)]
sentence_b = [("He", None), ("was", wn.VERB), ("attacked", wn.VERB), ("by", None), ("a", None), ("lion", wn.NOUN)]
print(get_hypernyms(sentence_a, n=-13))
print(get_hypernyms(sentence_b, n=-13))
[('I', None), ('love', 'v'), ('my', None), ('domestic_animal', 'n'), ('and', None), ('feline', 'n')] [('He', None), ('was', 'v'), ('contend', 'v'), ('by', None), ('a', None), ('feline', 'n')]
/opt/homebrew/Caskroom/miniforge/base/envs/python_intro/lib/python3.9/site-packages/nltk/corpus/reader/wordnet.py:604: UserWarning: Discarded redundant search for Synset('animal.n.01') at depth 7 for synset in acyclic_breadth_first(self, rel, depth):