We will see how to:
We will be loading a corpus of Amazon reviews for the products in the category "Electronics"
# codecs library helps with reading from (or writing to) files with different encodings (e.g., UTF-8, ANSII, etc.)
import codecs
# specifying the location of the file containing our reviews
filepath = "unlabeled_reviews.txt"
# loading the content of the file and creating a list of texts from its lines (one review per line)
rows = list(codecs.open(filepath, "r", encoding = 'utf8', errors = 'replace').readlines())
dataset = [l.strip() for l in rows]
... let's just look at some statistics and examples.
# let's see how many reviews we have in our dataset
print("We have: " + str(len(dataset)) + " reviews in the dataset.")
# The first row is ..
print(dataset[0])
# let's do some tokenization
from spacy.lang.en import English
nlp = English()
# create a Tokenizer with the default settings for English
# including punctuation rules and exceptions
tokenizer = nlp.tokenizer
# tokenize each of the rows
dataset_tokenized = [tokenizer(d) for d in dataset]
print(dataset_tokenized[0][2])
token_list = list(dataset_tokenized[0])
token_list
length_characters = [len(d) for d in dataset]
length_words = [len(d) for d in dataset_tokenized]
print(max(length_characters))
print(max(length_words))
import matplotlib.pyplot as plt
plt.hist(length_words, bins = 10)
plt.show()
We will compute a term-document matrix, a matrix whose rows correspond to terms, whose columns correspond to documents, and whose element at position (t,d) is 1 if the document in column d contains the term in row t , and is 0 otherwise. The matrix will be displayed as a pandas data frame to easily visualize term and document labels of rows and columns.
# For faster demonstration (also colab will run oom) let's only look at the first k documents
dataset_tokenized = dataset_tokenized[:100]
import pandas as pd
from scipy.sparse import lil_matrix
from gensim.parsing.preprocessing import STOPWORDS
remove_list = list(STOPWORDS) + [",", ".", ":", "*", ";", "?", "!", "-"]
# let's assign contexts to terms
d = {}
for j, tokens in enumerate(dataset_tokenized):
for t in tokens:
t = str(t).lower()
if t not in remove_list:
#The setdefault() method returns the value of the item with the specified key.
#If the key does not exist, insert the key, with the specified value, see example below
d[t] = d.setdefault(t, [])
d[t].append(j) #
A = lil_matrix((len(d.keys()), len(dataset_tokenized)), dtype=int)
for i, t in enumerate(d.keys()):
for j in d[t]:
A[i, j] = 1
A_df = pd.DataFrame(A.toarray(), index=d.keys(), columns=range(len(dataset_tokenized)))
A_df
Now, we compute the SVD (only for demonstration, we compute the full SVD)
from scipy.linalg import svd
u, s, vt = svd(A.toarray(), full_matrices = False)
Let's look at the obtained matrices ...
u.shape
The index of our term-document data frame contains the words in our vocabulary. Their index number corresponds to the position (row) in the matrix u.
ind = 40
A_df.index[ind]
u[ind]
We reduce the dimensionality by keeping only the most indicative columns.
k = 10
u_k = u[:,:k]
u_k[ind]
We can use the obtained dense vectors for similarity comparisons.
from sklearn.metrics.pairwise import cosine_similarity
similarity_matrix = cosine_similarity(u_k, u_k)
similarity_matrix
import numpy as np
np.argsort(similarity_matrix[ind])
similarity_matrix[ind][4]
A_df.index[4]
SpaCy also allows you to easily load and use pretrained Word2Vec embeddings. These vectors capture meaning of words better than LSA. In order to load the embeddings, you need to use either the medium-sized or large Spacy models (ending with "md" or "lg"), the small models (ending with "sm") do not pack word embeddings.
import spacy
# we're going to have to first download the spacy model "en_core_web_md"
# python -m spacy download en_core_web_md
nlp = spacy.load('en_core_web_md')
# Let's analyse word similarities based on word embeddings
word1 = "black"
word2 = "white"
d1 = nlp(word1)
d2 = nlp(word2)
print("Vector of " + word1 + ":")
#print(d1.vector.shape)
#print(d1.vector)
print("Vector of " + word2 + ":")
#print(d2.vector)
sim = d1.similarity(d2)
print()
print("Semantic similarity between " + word1 + " and " + word2 + ": " + str(sim))
Document embeddings are computed as averages of their word embeddings by SpaCy. Let us compute the vectors for all reviews in our collection.
import numpy as np
def mat_norm(mat, norm_order=2, axis=1):
return mat / np.transpose([np.linalg.norm(mat, norm_order, axis)])
review_vectors = []
print(len(dataset))
for doc in nlp.pipe(dataset[:1000], disable=["tagger", "parser", "ner"]):
review_vectors.append(doc.vector)
print(len(review_vectors))
print(review_vectors[0].shape)
# let us stack these vectors into a matrix, so that we can later easily compute semantic similarities for all pairs of reviews
reviews_matrix = np.array(review_vectors)
print(reviews_matrix.shape)
# fast way of computing cosines for all pairs of vectors
# each vector in "reviews_matrix" should be divided with its norm (L2-normalized): function mat_norm
rev_mat_norm = mat_norm(reviews_matrix)
# then the cosines are simply dot products between the vectors: matrix multiplied with its own inverse: function cosine_mats
similarities = np.matmul(rev_mat_norm, np.transpose(rev_mat_norm))
print(similarities.shape)
print(similarities[:50, :50])
Let's find the pairs of documents with the largest similarities (i.e., largest scores in the "similarities" matrix)
def sort_values_by_indices(mat):
print(np.sort(mat, axis=None))
print(np.argsort(mat, axis=None))
# unravel index: converts a flat index or array of flat indices into a tuple of coordinate arrays.
print(np.unravel_index(np.argsort(mat, axis=None), mat.shape))
print(np.dstack(np.unravel_index(np.argsort(mat, axis=None), mat.shape)))
return np.dstack(np.unravel_index(np.argsort(mat, axis=None), mat.shape))[0]
sorted_similarities = sort_values_by_indices(similarities)
sorted_similarities
# as we are looking at 1000 documents, the first 2000 most similar pairs will always be the document with itself
most_similar = list(reversed(sorted_similarities))[2000:]
most_similar
Word/document embeddings allow us to detect similarities between documents even if they don't use same words. This is semantic similarity (vs. just word overlap).
print(similarities[46][711])
print(dataset[46]) #596, 79
print()
print(dataset[711])
Parsing and part-of-speech-tagging
a = ["data", "science", "digital", "humanities"]
"###".join(a)
# wordcloud is a Python library for generating word clouds from text
import wordcloud
from wordcloud import WordCloud
# matplotlib is a Python library for plotting / rendering data and graphs of all kinds
import matplotlib.pyplot as plt
# concatenating all our reviews into one long text
big_text = " ".join(dataset)
# creating an object of WordCloud which we will then depict: defining white background and feeding our big text
wcloud = WordCloud(collocations = False, background_color = "white").generate_from_text(big_text)
# setting the size of the figure for wordclous plot
plt.figure(figsize=(8,7))
# plotting the created wordcloud
plt.imshow(wcloud, interpolation = "bilinear")
# we don't want to plot the axis
plt.axis("off")
# showing the plot
plt.show()
SpaCy has pre-trained models for POS-tagging and dependency parsing
# parsing and pos-tagging may take longer on large datasets, so let's only demonstrate the functionality on first N reviews in our collection
N = 50
small_dataset = dataset[:N]
for doc in nlp.pipe(small_dataset, disable=["ner"]):
# POS-tagging and parses are performed on a sentence-level (not document level, like tokenization)
for sent in doc.sents:
print(sent.text)
print("------------------------------------")
for tok in sent:
# tok.i index of the token in the document
# sent.start index of the starting token of the sentence in the document
# tok.tag_ is the fine-grained POS-tag;
# tok.pos_ is the coarse-grained POS-tag;
# tok.dep_ is the dependency relation from the governing token to this token
# tok.head is the token which is the syntactic head of the current token (.text is it's text, .i its index in the document)
print(tok.i - sent.start, tok.text, tok.tag_, tok.pos_, tok.dep_, tok.head.text, "(" + str(tok.head.i - sent.start) + ")")
print()
# displacy sublibrary part of SpaCy lets us visualize the dependency parse of a sentence
from spacy import displacy
sentence = "First of all, why do you want a power UPS?"
sent = nlp(sentence)
displacy.serve(sent, style = "dep")
# loading the texts from an existing dataset, the famous "20 news groups"
from sklearn.datasets import fetch_20newsgroups
# filtering news from the categories "talk.politics.misc" and "talk.politics.mideast"
ng_data = fetch_20newsgroups(subset='train', categories = ["talk.politics.misc", "talk.politics.mideast"])
print(len(ng_data.data))
num_docs = 100
texts = ng_data.data[: num_docs]
1029
# Let's go back to our friend SpaCy (and its visualization sub-library *displacy*)
import spacy
# the default nlp pipeline will perform tokenization, POS-tagging, dependency parsing, and named entity recognition
nlp = spacy.load("en_core_web_sm")
ind = 7
text = texts[ind]
# running the spacy pipeline, without pos-tagging and parsing, on the newsgroup texts
doc = nlp(text)
print(text)
print()
print("Entities: \n--------------------------------")
# you can see the meaning of NER labels from SpaCy here: https://spacy.io/api/annotation#named-entities
for ent in doc.ents:
print(ent.text + " (" + ent.label_ + "); start: character " + str(ent.start_char) + "; end: character " + str(ent.end_char))
From: helfman@aero.org (Robert S. Helfman)
Subject: Re: Clinton's Wiretapping Initiative
Organization: The Aerospace Corporation, El Segundo, CA
Lines: 22
NNTP-Posting-Host: aerospace.aero.org
In article <9304161803.AA23713@inet-gw-2.pa.dec.com> blh@uiboise.idbsu.edu (Broward L. Horne) writes:
>
> If you look through this newsgroup, you should be
> able to find Clinton's proposed "Wiretapping" Initiative
^^^^^^^^^
> for our computer networks and telephone systems.
>
> This 'initiative" has been up before Congress for at least
> the past 6 months, in the guise of the "FBI Wiretapping"
^^^^^^^^^^^^^^^^^
> bill.
What kind of brainless clod posted the above garbage? Would they be
so kind as to explain how this is "Clinton's" initiative, when it
has been before Congress for "at least the past 6 months"?
Jeez, the next thing you know, they'll be blaming the weather on the
poor guy. They'll be blaming World War II on him. They'll be blaming
the Civil War on him. Maybe the Thirty Years War?
Entities:
--------------------------------
Robert S. Helfman (PERSON); start: character 24; end: character 41
Clinton (PERSON); start: character 56; end: character 63
Wiretapping Initiative
Organization: (ORG); start: character 66; end: character 102
The Aerospace Corporation (ORG); start: character 103; end: character 128
El Segundo (GPE); start: character 130; end: character 140
CA
Lines (ORG); start: character 142; end: character 150
22 (CARDINAL); start: character 152; end: character 154
NNTP-Posting-Host (ORG); start: character 155; end: character 172
Broward L. Horne (PERSON); start: character 270; end: character 286
Clinton (PERSON); start: character 366; end: character 373
Wiretapping" Initiative
^^^^^^^^^
> (WORK_OF_ART); start: character 386; end: character 442
Congress (ORG); start: character 533; end: character 541
the past 6 months (DATE); start: character 557; end: character 574
FBI Wiretapping (WORK_OF_ART); start: character 597; end: character 612
Clinton (PERSON); start: character 752; end: character 759
Congress (ORG); start: character 799; end: character 807
the past 6 months (DATE); start: character 822; end: character 839
Jeez (PERSON); start: character 843; end: character 847
World War II (EVENT); start: character 941; end: character 953
the Civil War (EVENT); start: character 981; end: character 994
the Thirty Years War (EVENT); start: character 1009; end: character 1029
from spacy import displacy
# you can also nicely visualize the named entities
displacy.render(doc, style="ent")
First, we'll examine some rule-based matching. Let's do one simple pattern that extracts is-a/type-of relations between entities: "X such as Y"
import spacy
from spacy.matcher import Matcher
#nlp = spacy.load("en_core_web_sm")
matcher = Matcher(nlp.vocab)
pattern_suchas1 = [[{'POS':'ADJ'}, {'POS':'NOUN'}, {'POS':'NOUN'}, {'TEXT': 'such'}, {'TEXT': 'as'}, {'POS': 'PROPN'}]]
pattern_suchas2 = [[{'POS':'NOUN'}, {'TEXT': 'such'}, {'TEXT': 'as'}, {'ENT_TYPE': "ORG"}]] # 383 is the integer code for "ORG", 384 for "GPE"
text = "EU plan to copy Australia and make big tech companies such as Google pay for news. In countries such as Australia, the tech giants such as Google already have to pay more."
def find_print_matches(pattern, text):
matcher.add("m", pattern)
doc = nlp(text)
matches = matcher(doc)
for m in matches:
span = doc[m[1]:m[2]]
print(span.text)
find_print_matches(pattern_suchas1, text)
big tech companies such as Google