Lexical semantics


In this guide we will explore how to represent texts semantically using semantic vector representations of words (i.e., word embeddings)

# Text and Language

We will see how to: 
- (1) model/capture the meaning of words
- (2) find syntactic/linguistic structure of sentences 
- (3) how to extract/structure relevant information from text (for example, named entities)  

## Data Loading
We will be loading a corpus of Amazon reviews for the products in the category "Electronics"

In [None]:
# codecs library helps with reading from (or writing to) files with different encodings (e.g., UTF-8, ANSII, etc.)
import codecs

# specifying the location of the file containing our reviews
filepath = "unlabeled_reviews.txt"

# loading the content of the file and creating a list of texts from its lines (one review per line)
dataset = [l.strip() for l in list(codecs.open(filepath, "r", encoding = 'utf8', errors = 'replace').readlines())]

## Data Exploration
... let's just look at some statistics and examples.

In [None]:
# let's see how many reviews we have in our dataset
print("We have: " + str(len(dataset)) + " reviews in the dataset.")

In [None]:
# The first row is ..
print(dataset[0])

In [None]:
# let's do some tokenization
from spacy.lang.en import English
nlp = English()

# create a Tokenizer with the default settings for English
# including punctuation rules and exceptions
tokenizer = nlp.tokenizer

# tokenize each of the rows
dataset_tokenized = [tokenizer(d) for d in dataset]
print(dataset_tokenized[0][0])

In [None]:
type(dataset_tokenized[0])

In [None]:
length_characters = [len(d) for d in dataset]
length_words = [len(d) for d in dataset_tokenized]

print(max(length_characters))
print(max(length_words))


In [None]:
import matplotlib.pyplot as plt

plt.hist(length_words, bins = 10)
plt.show()

## Latent Semantic Analysis

We will compute a term-document matrix, a matrix whose rows correspond to terms, whose columns correspond to documents, and whose element at position  (t,d)  is 1 if the document in column  d  contains the term in row  t , and is 0 otherwise. The matrix will be displayed as a pandas data frame to easily visualize term and document labels of rows and columns.

In [None]:
# For faster demonstration (also colab will run oom) let's only look at the first k documents
dataset_tokenized = dataset_tokenized[:100]

In [None]:
import pandas as pd
from scipy.sparse import lil_matrix
from gensim.parsing.preprocessing import STOPWORDS

remove_list = list(STOPWORDS) + [",", ".", ":", "*", ";", "?", "!", "-"]

# let's assign contexts to terms
d = {}
for j, tokens in enumerate(dataset_tokenized):
  for t in tokens:
    t = str(t).lower()
    if t not in remove_list:
      #The setdefault() method returns the value of the item with the specified key.
      #If the key does not exist, insert the key, with the specified value, see example below
      d[t] = d.setdefault(t, [])
      d[t].append(j) #
A = lil_matrix((len(d.keys()), len(dataset_tokenized)), dtype=int)
for i, t in enumerate(d.keys()):
    for j in d[t]:
        A[i, j] = 1   
A_df = pd.DataFrame(A.toarray(), index=d.keys(), columns=range(len(dataset_tokenized))) 
A_df


Now, we compute the SVD (only for demonstration, we compute the full SVD)

In [None]:
from scipy.linalg import svd
u, s, vt = svd(A.toarray())

Let's look at the obtained matrices ...

In [None]:
u.shape

The index of our term-document data frame contains the words in our vocabulary. Their index number corresponds to the position (row) in the matrix u.

In [None]:
ind = 40
A_df.index[ind]

In [None]:
u[ind]

We reduce the dimensionality by keeping only the most indicative columns.

In [None]:
k = 10
u_k = u[:,:k]

In [None]:
u_k[ind]

We can use the obtained dense vectors for similarity comparisons.

In [None]:
from sklearn.metrics.pairwise import cosine_similarity
similarity_matrix = cosine_similarity(u_k, u_k)
similarity_matrix

In [None]:
import numpy as np
np.argsort(similarity_matrix[ind])

In [None]:
similarity_matrix[ind][4]

In [None]:
A_df.index[4]

## Word Embeddings and Semantic Similarity

SpaCy also allows you to easily load and use pretrained Word2Vec embeddings. These vectors capture meaning of words better than LSA. In order to load the embeddings, you need to use either the medium-sized or large Spacy models (ending with "md" or "lg"), the small models (ending with "sm") do not pack word embeddings. 

In [None]:
import spacy

# we're going to have to first download the spacy model "en_core_web_md" 
nlp = spacy.load('en_core_web_md')

# Let's analyse word similarities based on word embeddings
word1 = "dog"
word2 = "puppy"

d1 = nlp(word1)
d2 = nlp(word2)

print("Vector of " + word1 + ":")
print(d1.vector)

print("Vector of " + word2 + ":")
print(d2.vector)

sim = d1.similarity(d2)

print()
print("Semantic similarity between " + word1 + " and " + word2 + ": " + str(sim))

#### Document-level similarity: 

Document embeddings are computed as averages of their word embeddings by SpaCy. Let us compute the vectors for all reviews in our collection. 

In [None]:
import numpy as np

def mat_norm(mat, norm_order=2, axis=1):
    return mat / np.transpose([np.linalg.norm(mat, norm_order, axis)])


review_vectors = []

for doc in nlp.pipe(dataset[:1000], disable=["tagger", "parser", "ner"]):
    review_vectors.append(doc.vector)
    
# let us stack these vectors into a matrix, so that we can later easily compute semantic similarities for all pairs of reviews
reviews_matrix = np.array(review_vectors)
print(reviews_matrix.shape)

# fast way of computing cosines for all pairs of vectors

# each vector in "reviews_matrix" should be divided with its norm (L2-normalized): function mat_norm
rev_mat_norm = mat_norm(reviews_matrix)

# then the cosines are simply dot products between the vectors: matrix multiplied with its own inverse: function cosine_mats
similarities = np.matmul(rev_mat_norm, np.transpose(rev_mat_norm))

print(similarities.shape)
print(similarities[:50, :50])

Let's find the pairs of documents with the largest similarities (i.e., largest scores in the "similarities" matrix)

In [None]:
def sort_values_by_indices(mat):
    print(np.sort(mat, axis=None))
    print(np.argsort(mat, axis=None))
    # unravel index: converts a flat index or array of flat indices into a tuple of coordinate arrays.
    print(np.unravel_index(np.argsort(mat, axis=None), mat.shape))
    print(np.dstack(np.unravel_index(np.argsort(mat, axis=None), mat.shape)))
    return np.dstack(np.unravel_index(np.argsort(mat, axis=None), mat.shape))[0]

In [None]:
sorted_similarities = sort_values_by_indices(similarities)

In [None]:
sorted_similarities

In [None]:
# as we are looking at 1000 documents, the first 2000 most similar pairs will always be the document with itself
most_similar = list(reversed(sorted_similarities))[2000:]
most_similar

Word/document embeddings allow us to detect similarities between documents even if they don't use same words. This is semantic similarity (vs. just word overlap). 

In [None]:
print(dataset[527]) #596,  79
print()
print(dataset[454])


## Linguistic/Syntactic Analysis of Text

Parsing and part-of-speech-tagging


In [None]:
# wordcloud is a Python library for generating word clouds from text
import wordcloud
from wordcloud import WordCloud

# matplotlib is a Python library for plotting / rendering data and graphs of all kinds
import matplotlib.pyplot as plt

# concatenating all our reviews into one long text
big_text = " ".join(dataset)

# creating an object of WordCloud which we will then depict: defining white background and feeding our big text
wcloud = WordCloud(collocations = False, background_color = "white", stopwords = remove_list).generate_from_text(big_text)

# setting the size of the figure for wordclous plot
plt.figure(figsize=(8,7))

# plotting the created wordcloud
plt.imshow(wcloud, interpolation = "bilinear")
# we don't want to plot the axis
plt.axis("off")
# showing the plot
plt.show()

### Linguistic annotations: POS-tagging and dependency parsing
SpaCy has pre-trained models for POS-tagging and dependency parsing


In [None]:
# parsing and pos-tagging may take longer on large datasets, so let's only demonstrate the functionality on first N reviews in our collection
N = 50
small_dataset = dataset[:N]

for doc in nlp.pipe(small_dataset, disable=["ner"]):
    # POS-tagging and parses are performed on a sentence-level (not document level, like tokenization)
    for sent in doc.sents:
        print(sent.text)
        print("------------------------------------")
        for tok in sent:
            # tok.i index of the token in the document
            # sent.start index of the starting token of the sentence in the document
            # tok.tag_ is the fine-grained POS-tag; 
            # tok.pos_ is the coarse-grained POS-tag; 
            # tok.dep_ is the dependency relation from the governing token to this token
            # tok.head is the token which is the syntactic head of the current token (.text is it's text, .i its index in the document)
            print(tok.i - sent.start, tok.text, tok.tag_, tok.pos_, tok.dep_, tok.head.text, "(" + str(tok.head.i - sent.start) + ")")
        print()    

In [None]:
# displacy sublibrary part of SpaCy lets us visualize the dependency parse of a sentence
from spacy import displacy

sentence = "First of all, why do you want a power UPS?"
sent = nlp(sentence)
displacy.serve(sent, style = "dep")

## Named Entity Recognition

### Data Loading

We will be working with messages from the "20 Newsgroups dataset", which comes with scikit-learn and can be easily loaded

In [1]:
# loading the texts from an existing dataset, the famous "20 news groups" 
from sklearn.datasets import fetch_20newsgroups

# filtering news from the categories "talk.politics.misc" and "talk.politics.mideast"
ng_data = fetch_20newsgroups(subset='train', categories = ["talk.politics.misc", "talk.politics.mideast"])

print(len(ng_data.data))

num_docs = 100
texts = ng_data.data[: num_docs]

1029


### NER with SpaCy

In [2]:
# Let's go back to our friend SpaCy (and its visualization sub-library *displacy*)
import spacy

# the default nlp pipeline will perform tokenization, POS-tagging, dependency parsing, and named entity recognition
nlp = spacy.load("en_core_web_sm")



In [3]:
ind = 9
text = texts[ind]
# running the spacy pipeline, without pos-tagging and parsing, on the newsgroup texts

doc = nlp(text)
print(text)

print()
print("Entities: \n--------------------------------")

# you can see the meaning of NER labels from SpaCy here: https://spacy.io/api/annotation#named-entities
for ent in doc.ents:
    print(ent.text + " (" + ent.label_ + "); start: character " + str(ent.start_char) + "; end: character " + str(ent.end_char))

From: hernlem@chess.ncsu.edu (Brad Hernlem)
Subject: Re: was:Go Hezbollah!
Reply-To: hernlem@chess.ncsu.edu (Brad Hernlem)
Organization: NCSU Chem Eng
Lines: 36


In article <93y04m18d459@witsend.uucp>, "D. C. Sessions" <dcs@witsend.tnet.com> writes:

|>   Please clarify your standards for rules of engagement.  As I
|>   understand it, Israelis are at all times and under all
|>   circumstances fair targets.  Their opponents are legitimate
|>   targets only when Mirandized, or some such?
|> 
|>   I'm sure that this makes perfect sense if you grant *a*priori*
|>   that Israelis are the Black Hats, and that therefore killing
|>   them is automatically a Good Thing (Go Hezbollah!).  The
|>   corollary is that the Hezbollah are the White Hats, and that
|>   whatever they do is a Good Thing, and the Israelis only prove
|>   themselves to be Bad Guys by attacking them.
|> 
|>   This sounds suspiciously like a hockey fan I know, who cheers
|>   when one of the players on His Team uses his stic

In [4]:
from spacy import displacy

# you can also nicely visualize the named entities
displacy.serve(doc, style="ent")




Using the 'ent' visualizer
Serving on http://0.0.0.0:5000 ...

Shutting down server on port 5000.


## Rule-based relation extraction

First, we'll examine some **rule-based matching**. Let's do one simple pattern that extracts is-a/type-of relations between entities: "X such as Y"


In [21]:
import spacy
from spacy.matcher import Matcher 

nlp = spacy.load("en_core_web_sm")
matcher = Matcher(nlp.vocab)

pattern_suchas1 = [[{'POS':'NOUN'}, {'TEXT': 'such'}, {'TEXT': 'as'}, {'POS': 'PROPN'}]]
pattern_suchas2 = [[{'POS':'NOUN'}, {'TEXT': 'such'}, {'TEXT': 'as'}, {'ENT_TYPE': "ORG"}]] # 383 is the integer code for "ORG", 384 for "GPE"
                       
text = "EU plan to copy Australia and make big tech companies such as Google pay for news. In countries such as Australia, the tech giants such as Google already have to pay more."

In [22]:
def find_print_matches(pattern, text):
    matcher.add("m", pattern)
    doc = nlp(text)
    matches = matcher(doc)

    for m in matches:
        span = doc[m[1]:m[2]]
        print(span.text)

In [23]:
find_print_matches(pattern_suchas2, text)

companies such as Google
giants such as Google
