After this session you'll know about:
Scikit-Learn
For this session, we'll assemble a dataset based on Wikipedia articles using our scraper from session 4.
We download some articles about Biology and Computer Science like so:
import pandas as pd
from wiki import build_wiki_dataset
articles_bio = build_wiki_dataset("biology", max_results=50)
articles_bio["topic"] = "biology" # Add a label describing the topic
articles_cs = build_wiki_dataset("computer science", max_results=50)
articles_cs["topic"] = "cs"
articles_dataset = pd.concat((articles_bio, articles_cs))
articles_dataset.to_csv("bio-cs-wiki-dataset.csv", index=False)
After, we saved our dataset to disk there is no need to execute the time consuming scraping each time we execute the notebook.
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd
# In this notebook we use some methods that rely on random initialization,
# so we fix the random seed, to get reproducible results.
from random import seed
np.random.seed(42), seed(42)
articles_dataset= pd.read_csv("bio-cs-wiki-dataset.csv")
articles_dataset
url | title | summary | text | num_views | num_edits | categories | revision_id | topic | |
---|---|---|---|---|---|---|---|---|---|
0 | https://en.wikipedia.org/wiki/Geology | Geology | Geology (from Ancient Greek γῆ (gê) 'earth', ... | Geology (from Ancient Greek γῆ (gê) 'earth', ... | 4234884 | 3285 | All articles with failed verification, Article... | 1127436018 | biology |
1 | https://en.wikipedia.org/wiki/Molecular_biology | Molecular biology | Molecular biology is the branch of biology th... | Molecular biology is the branch of biology th... | 2143110 | 1453 | Articles with BNE identifiers, Articles with B... | 1124222343 | biology |
2 | https://en.wikipedia.org/wiki/Family_(biology) | Family (biology) | Family (Latin: familia, plural familiae) is on... | Family (Latin: familia, plural familiae) is on... | 2773053 | 921 | Articles containing French-language text, Arti... | 1133677852 | biology |
3 | https://en.wikipedia.org/wiki/Domain_(biology) | Domain (biology) | In biological taxonomy, a domain ( or ) (Latin... | In biological taxonomy, a domain ( or ) (Latin... | 2315088 | 783 | Articles with short description, Domains (biol... | 1125372452 | biology |
4 | https://en.wikipedia.org/wiki/Taxonomy_(biology) | Taxonomy (biology) | In biology, taxonomy (from Ancient Greek τάξι... | In biology, taxonomy (from Ancient Greek τάξι... | 9881050 | 3868 | All articles lacking reliable references, All ... | 1133363688 | biology |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
88 | https://en.wikipedia.org/wiki/Session_(compute... | Session (computer science) | In computer science and networking in particul... | In computer science and networking in particul... | 688210 | 294 | All articles needing additional references, Ar... | 1132147435 | cs |
89 | https://en.wikipedia.org/wiki/Assignment_(comp... | Assignment (computer science) | In computer programming, an assignment stateme... | In computer programming, an assignment stateme... | 475319 | 527 | Articles with example C code, Articles with sh... | 1132355460 | cs |
90 | https://en.wikipedia.org/wiki/Variable_(comput... | Variable (computer science) | In computer programming, a variable is an abst... | In computer programming, a variable is an abst... | 936344 | 587 | All articles needing additional references, Ar... | 1125300855 | cs |
91 | https://en.wikipedia.org/wiki/Side_effect_(com... | Side effect (computer science) | In computer science, an operation, function or... | In computer science, an operation, function or... | 584658 | 259 | Articles with example code, Articles with shor... | 1104967033 | cs |
92 | https://en.wikipedia.org/wiki/Philosophy_of_co... | Philosophy of computer science | The philosophy of computer science is concerne... | The philosophy of computer science is concerne... | 203337 | 150 | Articles with short description, Philosophy of... | 1114652614 | cs |
93 rows × 9 columns
To encode our texts into numbers, we use the CountVectorizer
class offered by Scikit-Learn
(more on that library later).
It works similar to the BagOfWord-encoder class you already implemented as homework, but comes with additional useful properties.
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer(
max_features=500, # We only want to use the 500 most frequent terms
stop_words="english", # Filter out English stopwords (i.e., non-content functional terms like "and", "the", ...
lowercase=True, # Convert all characters to lower cases ones
ngram_range=(1, 3) # We not only want to count single words but also word-level n-grams up to a length of 3
)
freqs = cv.fit_transform(articles_dataset.text)
freqs = freqs.todense() # For educational purposes, we convert the matrix of word counts from sparse into dense format.
freqs
matrix([[ 0, 4, 0, ..., 3, 0, 10], [ 2, 0, 0, ..., 0, 0, 0], [ 0, 0, 0, ..., 0, 0, 0], ..., [ 0, 1, 3, ..., 0, 0, 0], [ 0, 0, 0, ..., 0, 0, 0], [ 1, 1, 1, ..., 1, 0, 0]], dtype=int64)
#print(cv.vocabulary_)
#articles_dataset.text[0]
freqs.shape
(93, 500)
To make the texts comparable regardless of their individual length, we compute the relative word frequencies
freqs = freqs / freqs.sum(axis=1).reshape(-1, 1)
print(freqs)
np.allclose(freqs.sum(axis=1), 1)
#print(np.allclose(freqs.sum(axis=1), 1))
[[0. 0.00343643 0. ... 0.00257732 0. 0.00859107] [0.00242718 0. 0. ... 0. 0. 0. ] [0. 0. 0. ... 0. 0. 0. ] ... [0. 0.00125313 0.0037594 ... 0. 0. 0. ] [0. 0. 0. ... 0. 0. 0. ] [0.00564972 0.00564972 0.00564972 ... 0.00564972 0. 0. ]]
True
As we've already learned our document term-matrix now consists of 94 rows because our dataset contains 94 texts, and each row has 5000 entries (=> columns in the matrix) each one representing the frequency of one of the 5000 most frequent words (or n-grams) of the corpus in this text.
If we slice a row out of the matrix, we'll get a vector:
freqs.shape
(93, 500)
In essence, you can think of a vector as a list of numbers where each entry in the list is called a dimension.
The total number of entries in the vector determines its dimensionality.
So text at index 23 (like all the other texts) is now described using a 500-dimensional vector.
As you might have learned in school, a great way to imagine vectors is to think of them as points in space.
Because the space our text-vectors reside in is 500-dimensional it's impossible to imagine them visually, but luckily the same things that apply to points in a three- or two-dimensional space, also apply to any other n-dimensional space.
ind = 23
print(freqs[ind])
print()
print(freqs[0])
[[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.00680272 0. 0. 0. 0. 0. 0. 0.01360544 0.00680272 0. 0. 0. 0.00680272 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.01360544 0.02721088 0. 0. 0.00680272 0. 0.02040816 0. 0. 0. 0. 0. 0.02040816 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.01360544 0. 0. 0. 0. 0. 0. 0.00680272 0.00680272 0. 0. 0.00680272 0. 0. 0. 0. 0. 0. 0. 0. 0.00680272 0. 0.00680272 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.00680272 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.00680272 0.00680272 0. 0. 0. 0. 0.00680272 0. 0.00680272 0. 0. 0.00680272 0. 0. 0. 0.00680272 0.00680272 0. 0. 0. 0. 0.00680272 0. 0. 0. 0. 0. 0. 0. 0.02040816 0. 0. 0. 0. 0. 0. 0.02040816 0.00680272 0. 0. 0.00680272 0. 0.04081633 0.00680272 0. 0. 0. 0. 0.03401361 0. 0.00680272 0.00680272 0. 0.02721088 0.00680272 0. 0. 0. 0. 0. 0.00680272 0. 0. 0.00680272 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.00680272 0. 0.00680272 0. 0. 0. 0. 0. 0. 0. 0.02040816 0. 0. 0.01360544 0. 0.01360544 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.01360544 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.00680272 0. 0. 0. 0. 0. 0. 0. 0. 0.00680272 0. 0.00680272 0.00680272 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.00680272 0. 0. 0. 0. 0. 0. 0. 0.01360544 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.00680272 0. 0.00680272 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.02721088 0.02040816 0.00680272 0. 0. 0. 0. 0. 0. 0.01360544 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.00680272 0. 0.00680272 0.00680272 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.00680272 0. 0.00680272 0.02040816 0. 0. 0. 0. 0. 0.00680272 0. 0. 0. 0. 0.00680272 0. 0. 0. 0. 0.00680272 0. 0. 0.00680272 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.01360544 0. 0. 0.00680272 0.00680272 0. 0. 0. 0. 0. 0. 0. 0. 0.05442177 0.00680272 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.00680272 0.06802721 0.02040816 0. 0.03401361 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.01360544 0.00680272 0. 0. 0. 0.00680272 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.00680272 0. 0. 0. 0. 0. 0.00680272 0.00680272 0. 0. 0. 0. 0.00680272 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.01360544 0. 0. 0. 0. ]] [[0. 0.00343643 0. 0. 0. 0.00085911 0. 0. 0. 0.00085911 0.00085911 0.00085911 0.00343643 0. 0. 0. 0.00085911 0.00171821 0. 0.00429553 0.00085911 0.00171821 0.00085911 0.00171821 0.00257732 0. 0. 0. 0. 0.00343643 0.00257732 0. 0. 0. 0.00257732 0. 0. 0. 0.00429553 0. 0. 0. 0. 0. 0. 0.00085911 0. 0. 0. 0.00085911 0. 0.00171821 0.00515464 0. 0. 0. 0.00085911 0. 0. 0. 0.00601375 0. 0.01030928 0.00687285 0. 0. 0.00515464 0. 0. 0.00085911 0. 0. 0. 0. 0. 0.00085911 0.00171821 0. 0. 0. 0.00171821 0. 0.00085911 0. 0. 0.00171821 0. 0. 0. 0. 0.00171821 0.00171821 0. 0. 0. 0.00257732 0.00171821 0. 0. 0.00171821 0. 0.00085911 0.00085911 0.00773196 0. 0. 0. 0. 0.00085911 0.00085911 0. 0.00085911 0. 0.00687285 0.00343643 0.00429553 0.00429553 0. 0.00085911 0. 0.00171821 0. 0. 0. 0. 0.00085911 0.00257732 0.0532646 0. 0. 0. 0. 0.00343643 0.00343643 0. 0.00773196 0.00171821 0.00257732 0.00171821 0. 0.00171821 0. 0.00515464 0.00515464 0.00171821 0.00343643 0.00257732 0. 0.00171821 0. 0.00085911 0.00085911 0.00171821 0. 0.00085911 0. 0.00515464 0.00773196 0.00171821 0. 0.00171821 0. 0.00343643 0. 0.01030928 0. 0. 0. 0. 0. 0.00085911 0. 0. 0. 0.00085911 0. 0. 0. 0. 0. 0. 0. 0.04295533 0.04037801 0.00171821 0.00171821 0. 0. 0.00171821 0. 0. 0.00343643 0.00171821 0.01116838 0. 0. 0. 0. 0. 0. 0. 0.00515464 0.00515464 0.00171821 0.00085911 0.00429553 0.00171821 0.00515464 0. 0. 0.00085911 0. 0. 0. 0.00085911 0. 0.00171821 0.00171821 0.00085911 0. 0.00171821 0. 0. 0. 0. 0.00171821 0. 0.00085911 0.00687285 0.00515464 0.00085911 0. 0.00257732 0. 0.00343643 0.00515464 0. 0. 0.00085911 0.00171821 0. 0. 0.00773196 0.00171821 0.00085911 0.00085911 0. 0. 0. 0. 0.00257732 0. 0. 0. 0.00171821 0. 0. 0. 0. 0. 0.00515464 0. 0.00085911 0. 0.00171821 0.00945017 0. 0. 0.00171821 0.00085911 0. 0. 0. 0. 0. 0. 0.01546392 0.00257732 0.00257732 0.00171821 0.00687285 0. 0. 0. 0. 0. 0.00085911 0.01030928 0. 0. 0. 0. 0. 0.00773196 0. 0. 0.00085911 0. 0.00171821 0.00085911 0. 0. 0.00085911 0.00773196 0.00429553 0.00257732 0. 0. 0. 0. 0.00171821 0. 0.00257732 0. 0. 0.00171821 0. 0.00085911 0.00343643 0.00429553 0. 0.00085911 0.00085911 0. 0. 0.00257732 0. 0. 0.00515464 0. 0. 0.00171821 0. 0. 0. 0.00085911 0.00085911 0.00687285 0.00601375 0.00859107 0. 0.00171821 0.00171821 0.01632302 0. 0.00085911 0.00085911 0. 0. 0. 0. 0. 0.00085911 0.00085911 0.00429553 0.00085911 0. 0. 0. 0. 0.00601375 0.00257732 0.00343643 0. 0.00085911 0.00257732 0. 0.00171821 0. 0. 0. 0. 0.00085911 0.00085911 0.00085911 0.00343643 0.00429553 0. 0. 0. 0. 0.00343643 0. 0. 0.00429553 0. 0.00601375 0.00257732 0. 0.05154639 0.03350515 0.00085911 0. 0.00343643 0.00687285 0.00171821 0.00085911 0.00171821 0. 0.00257732 0.00085911 0.00171821 0. 0. 0. 0.00085911 0.00343643 0.00257732 0. 0. 0. 0. 0. 0. 0.00171821 0.00171821 0. 0.00085911 0.00343643 0. 0.00085911 0. 0.00429553 0. 0. 0.00343643 0.00343643 0. 0.00085911 0.00171821 0. 0. 0.00085911 0. 0. 0.00945017 0. 0. 0. 0.00085911 0. 0. 0.00515464 0.00773196 0.00429553 0.00601375 0.01546392 0. 0.01116838 0. 0. 0. 0.00085911 0. 0. 0.00257732 0. 0.00085911 0.00257732 0. 0. 0.01116838 0.01804124 0.00257732 0. 0. 0. 0. 0. 0. 0. 0.00515464 0.00171821 0.00687285 0.00343643 0.00171821 0.02319588 0. 0.01460481 0.01546392 0.00171821 0. 0.00171821 0.00429553 0. 0. 0. 0. 0. 0.00257732 0.00343643 0.00343643 0.00343643 0.00171821 0. 0.00343643 0.00601375 0.00257732 0. 0.00859107]]
Now that we've converted our texts into vector representations we can leverage a huge set of vector-based operations to investigate them.
The most fundamental operation to compare vectors (aka points in space) is to measure the distance between them.
There are several measures to quantify the distance between two points, with the most intuitive being the Euclidean distance:
$$ d_E(q, p) = \sqrt{\sum_{i=1}^{N} (q_i - p_i)^2} $$The Euclidean distance measures the length of a straight line drawn from point $q$ to point $p$.
But there are additional metrics that depending on the vector space they operate on might bear other notions of distance
https://miro.medium.com/max/1400/1*FTVRr_Wqz-3_k6Mk6G4kew.png
If operating on texts, the cosine similarity is useful, because it is based on the angle between any two vectors, and thus not affected by the vector's magnitude, making it insensitive against differences in text lengths. Cosine similarity between two vectors is the cosine of the angle between them and is computed as follows:
$$ \mathit{cos}(q, p) = \frac{\sum_{i=1}^{N}{q_i \cdot p_i}}{\sqrt{\sum_{i=1}^{N}{q_i^2}} \cdot \sqrt{\sum_{i=1}^{N}{p_i^2}}} $$Cosine distance is then computed as 1 minus the cosine similarity: $$ d_C(q, p) = 1 - \mathit{cos}(q, p) $$
Let's use the cosine similarity to get a more global overview of our texts.
To do this we create a (n_texts, n_texts)
similarity matrix containing the similarity from each text to all other texts.
Note that, computing all sorts of from-each-to-each relations or measures is generally computationally expensive and might require large amounts of time, memory or computing.
But because our dataset is fairly small, we won't run into any of those issues.
First, we leverage some scipy
features to compute the raw distance matrix:
from scipy.spatial.distance import pdist, squareform, cosine
distance_matrix = squareform(pdist(freqs, metric=cosine))
distance_matrix.shape
distance_matrix
array([[0. , 0.86718687, 0.86397228, ..., 0.9368274 , 0.94750318, 0.93840333], [0.86718687, 0. , 0.85416964, ..., 0.94105335, 0.93885318, 0.96700857], [0.86397228, 0.85416964, 0. , ..., 0.90603486, 0.93232172, 0.98615574], ..., [0.9368274 , 0.94105335, 0.90603486, ..., 0. , 0.76761092, 0.96092689], [0.94750318, 0.93885318, 0.93232172, ..., 0.76761092, 0. , 0.93418835], [0.93840333, 0.96700857, 0.98615574, ..., 0.96092689, 0.93418835, 0. ]])
Additionally, we create a new Dataframe
for the matrix and add the content-labels (bio, computer science) to the Dataframe
distance_matrix = pd.DataFrame(data=distance_matrix, columns=articles_dataset["topic"], index=articles_dataset["topic"])
fig, ax = plt.subplots(1, 1, figsize=(12, 12))
sns.heatmap(data=distance_matrix, ax=ax, xticklabels=True, yticklabels=True)
<AxesSubplot: xlabel='topic', ylabel='topic'>
Darker colors indicate a smaller distance between any points, so it becomes obvious, that texts from within each category, are closer to another than to texts from the other category.
To not only visualize the distances graphically, we can also apply our descriptive statistics toolset to it:
print(articles_dataset.text[0])
print("======================================")
print(articles_dataset.text[26])
Geology (from Ancient Greek γῆ (gê) 'earth', and λoγία (-logía) 'study of, discourse') is a branch of natural science concerned with Earth and other astronomical objects, the features or rocks of which it is composed, and the processes by which they change over time. Modern geology significantly overlaps all other Earth sciences, including hydrology, and so is treated as one major aspect of integrated Earth system science and planetary science. Geology describes the structure of the Earth on and beneath its surface, and the processes that have shaped that structure. It also provides tools to determine the relative and absolute ages of rocks found in a given location, and also to describe the histories of those rocks. By combining these tools, geologists are able to chronicle the geological history of the Earth as a whole, and also to demonstrate the age of the Earth. Geology provides the primary evidence for plate tectonics, the evolutionary history of life, and the Earth's past climates. Geologists broadly study the properties and processes of Earth and other terrestrial planets and predominantly solid planetary bodies. Geologists use a wide variety of methods to understand the Earth's structure and evolution, including field work, rock description, geophysical techniques, chemical analysis, physical experiments, and numerical modelling. In practical terms, geology is important for mineral and hydrocarbon exploration and exploitation, evaluating water resources, understanding natural hazards, the remediation of environmental problems, and providing insights into past climate change. Geology is a major academic discipline, and it is central to geological engineering and plays an important role in geotechnical engineering. == Geological material == The majority of geological data comes from research on solid Earth materials. Meteorites and other extraterrestrial natural materials are also studied by geological methods. === Mineral === Minerals are natural occurring elements and compounds with a definite homogeneous chemical composition and ordered atomic composition. Each mineral has distinct physical properties, and there are many tests to determine each of them. The specimens can be tested for: Luster: Quality of light reflected from the surface of a mineral. Examples are metallic, pearly, waxy, dull. Color: Minerals are grouped by their color. Mostly diagnostic but impurities can change a mineral's color. Streak: Performed by scratching the sample on a porcelain plate. The color of the streak can help name the mineral. Hardness: The resistance of a mineral to scratching. Breakage pattern: A mineral can either show fracture or cleavage, the former being breakage of uneven surfaces, and the latter a breakage along closely spaced parallel planes. Specific gravity: the weight of a specific volume of a mineral. Effervescence: Involves dripping hydrochloric acid on the mineral to test for fizzing. Magnetism: Involves using a magnet to test for magnetism. Taste: Minerals can have a distinctive taste, such as halite (which tastes like table salt). === Rock === A rock is any naturally occurring solid mass or aggregate of minerals or mineraloids. Most research in geology is associated with the study of rocks, as they provide the primary record of the majority of the geological history of the Earth. There are three major types of rock: igneous, sedimentary, and metamorphic. The rock cycle illustrates the relationships among them (see diagram). When a rock solidifies or crystallizes from melt (magma or lava), it is an igneous rock. This rock can be weathered and eroded, then redeposited and lithified into a sedimentary rock. It can then be turned into a metamorphic rock by heat and pressure that change its mineral content, resulting in a characteristic fabric. All three types may melt again, and when this happens, new magma is formed, from which an igneous rock may once more solidify. Organic matter, such as coal, bitumen, oil and natural gas, is linked mainly to organic-rich sedimentary rocks. To study all three types of rock, geologists evaluate the minerals of which they are composed and their other physical properties, such as texture and fabric. === Unlithified material === Geologists also study unlithified materials (referred to as superficial deposits) that lie above the bedrock. This study is often known as Quaternary geology, after the Quaternary period of geologic history, which is the most recent period of geologic time. ==== Magma ==== Magma is the original unlithified source of all igneous rocks. The active flow of molten rock is closely studied in volcanology, and igneous petrology aims to determine the history of igneous rocks from their original molten source to their final crystallization. == Whole-Earth structure == === Plate tectonics === In the 1960s, it was discovered that the Earth's lithosphere, which includes the crust and rigid uppermost portion of the upper mantle, is separated into tectonic plates that move across the plastically deforming, solid, upper mantle, which is called the asthenosphere. This theory is supported by several types of observations, including seafloor spreading and the global distribution of mountain terrain and seismicity. There is an intimate coupling between the movement of the plates on the surface and the convection of the mantle (that is, the heat transfer caused by the slow movement of ductile mantle rock). Thus, oceanic plates and the adjoining mantle convection currents always move in the same direction – because the oceanic lithosphere is actually the rigid upper thermal boundary layer of the convecting mantle. This coupling between rigid plates moving on the surface of the Earth and the convecting mantle is called plate tectonics. The development of plate tectonics has provided a physical basis for many observations of the solid Earth. Long linear regions of geological features are explained as plate boundaries. For example: Mid-ocean ridges, high regions on the seafloor where hydrothermal vents and volcanoes exist, are seen as divergent boundaries, where two plates move apart. Arcs of volcanoes and earthquakes are theorized as convergent boundaries, where one plate subducts, or moves, under another.Transform boundaries, such as the San Andreas Fault system, resulted in widespread powerful earthquakes. Plate tectonics also has provided a mechanism for Alfred Wegener's theory of continental drift, in which the continents move across the surface of the Earth over geological time. They also provided a driving force for crustal deformation, and a new setting for the observations of structural geology. The power of the theory of plate tectonics lies in its ability to combine all of these observations into a single theory of how the lithosphere moves over the convecting mantle. === Earth structure === Advances in seismology, computer modeling, and mineralogy and crystallography at high temperatures and pressures give insights into the internal composition and structure of the Earth. Seismologists can use the arrival times of seismic waves to image the interior of the Earth. Early advances in this field showed the existence of a liquid outer core (where shear waves were not able to propagate) and a dense solid inner core. These advances led to the development of a layered model of the Earth, with a crust and lithosphere on top, the mantle below (separated within itself by seismic discontinuities at 410 and 660 kilometers), and the outer core and inner core below that. More recently, seismologists have been able to create detailed images of wave speeds inside the earth in the same way a doctor images a body in a CT scan. These images have led to a much more detailed view of the interior of the Earth, and have replaced the simplified layered model with a much more dynamic model. Mineralogists have been able to use the pressure and temperature data from the seismic and modeling studies alongside knowledge of the elemental composition of the Earth to reproduce these conditions in experimental settings and measure changes in crystal structure. These studies explain the chemical changes associated with the major seismic discontinuities in the mantle and show the crystallographic structures expected in the inner core of the Earth. == Geological time == The geological time scale encompasses the history of the Earth. It is bracketed at the earliest by the dates of the first Solar System material at 4.567 Ga (or 4.567 billion years ago) and the formation of the Earth at 4.54 Ga (4.54 billion years), which is the beginning of the informally recognized Hadean eon – a division of geological time. At the later end of the scale, it is marked by the present day (in the Holocene epoch). === Timescale of the Earth === The following five timelines show the geologic time scale to scale. The first shows the entire time from the formation of the Earth to the present, but this gives little space for the most recent eon. The second timeline shows an expanded view of the most recent eon. In a similar way, the most recent era is expanded in the third timeline, the most recent period is expanded in the fourth timeline, and the most recent epoch is expanded in the fifth timeline. === Important milestones on Earth === 4.567 Ga (gigaannum: billion years ago): Solar system formation 4.54 Ga: Accretion, or formation, of Earth c. 4 Ga: End of Late Heavy Bombardment, the first life c. 3.5 Ga: Start of photosynthesis c. 2.3 Ga: Oxygenated atmosphere, first snowball Earth 730–635 Ma (megaannum: million years ago): second snowball Earth 541 ± 0.3 Ma: Cambrian explosion – vast multiplication of hard-bodied life; first abundant fossils; start of the Paleozoic c. 380 Ma: First vertebrate land animals 250 Ma: Permian-Triassic extinction – 90% of all land animals die; end of Paleozoic and beginning of Mesozoic 66 Ma: Cretaceous–Paleogene extinction – Dinosaurs die; end of Mesozoic and beginning of Cenozoic c. 7 Ma: First hominins appear 3.9 Ma: First Australopithecus, direct ancestor to modern Homo sapiens, appear 200 ka (kiloannum: thousand years ago): First modern Homo sapiens appear in East Africa === Timescale of the Moon === === Timescale of Mars === == Dating methods == === Relative dating === Methods for relative dating were developed when geology first emerged as a natural science. Geologists still use the following principles today as a means to provide information about geological history and the timing of geological events. The principle of uniformitarianism states that the geological processes observed in operation that modify the Earth's crust at present have worked in much the same way over geological time. A fundamental principle of geology advanced by the 18th-century Scottish physician and geologist James Hutton is that "the present is the key to the past." In Hutton's words: "the past history of our globe must be explained by what can be seen to be happening now."The principle of intrusive relationships concerns crosscutting intrusions. In geology, when an igneous intrusion cuts across a formation of sedimentary rock, it can be determined that the igneous intrusion is younger than the sedimentary rock. Different types of intrusions include stocks, laccoliths, batholiths, sills and dikes. The principle of cross-cutting relationships pertains to the formation of faults and the age of the sequences through which they cut. Faults are younger than the rocks they cut; accordingly, if a fault is found that penetrates some formations but not those on top of it, then the formations that were cut are older than the fault, and the ones that are not cut must be younger than the fault. Finding the key bed in these situations may help determine whether the fault is a normal fault or a thrust fault.The principle of inclusions and components states that, with sedimentary rocks, if inclusions (or clasts) are found in a formation, then the inclusions must be older than the formation that contains them. For example, in sedimentary rocks, it is common for gravel from an older formation to be ripped up and included in a newer layer. A similar situation with igneous rocks occurs when xenoliths are found. These foreign bodies are picked up as magma or lava flows, and are incorporated, later to cool in the matrix. As a result, xenoliths are older than the rock that contains them. The principle of original horizontality states that the deposition of sediments occurs as essentially horizontal beds. Observation of modern marine and non-marine sediments in a wide variety of environments supports this generalization (although cross-bedding is inclined, the overall orientation of cross-bedded units is horizontal).The principle of superposition states that a sedimentary rock layer in a tectonically undisturbed sequence is younger than the one beneath it and older than the one above it. Logically a younger layer cannot slip beneath a layer previously deposited. This principle allows sedimentary layers to be viewed as a form of the vertical timeline, a partial or complete record of the time elapsed from deposition of the lowest layer to deposition of the highest bed.The principle of faunal succession is based on the appearance of fossils in sedimentary rocks. As organisms exist during the same period throughout the world, their presence or (sometimes) absence provides a relative age of the formations where they appear. Based on principles that William Smith laid out almost a hundred years before the publication of Charles Darwin's theory of evolution, the principles of succession developed independently of evolutionary thought. The principle becomes quite complex, however, given the uncertainties of fossilization, localization of fossil types due to lateral changes in habitat (facies change in sedimentary strata), and that not all fossils formed globally at the same time. === Absolute dating === Geologists also use methods to determine the absolute age of rock samples and geological events. These dates are useful on their own and may also be used in conjunction with relative dating methods or to calibrate relative methods.At the beginning of the 20th century, advancement in geological science was facilitated by the ability to obtain accurate absolute dates to geological events using radioactive isotopes and other methods. This changed the understanding of geological time. Previously, geologists could only use fossils and stratigraphic correlation to date sections of rock relative to one another. With isotopic dates, it became possible to assign absolute ages to rock units, and these absolute dates could be applied to fossil sequences in which there was datable material, converting the old relative ages into new absolute ages. For many geological applications, isotope ratios of radioactive elements are measured in minerals that give the amount of time that has passed since a rock passed through its particular closure temperature, the point at which different radiometric isotopes stop diffusing into and out of the crystal lattice. These are used in geochronologic and thermochronologic studies. Common methods include uranium–lead dating, potassium–argon dating, argon–argon dating and uranium–thorium dating. These methods are used for a variety of applications. Dating of lava and volcanic ash layers found within a stratigraphic sequence can provide absolute age data for sedimentary rock units that do not contain radioactive isotopes and calibrate relative dating techniques. These methods can also be used to determine ages of pluton emplacement. Thermochemical techniques can be used to determine temperature profiles within the crust, the uplift of mountain ranges, and paleo-topography. Fractionation of the lanthanide series elements is used to compute ages since rocks were removed from the mantle. Other methods are used for more recent events. Optically stimulated luminescence and cosmogenic radionuclide dating are used to date surfaces and/or erosion rates. Dendrochronology can also be used for the dating of landscapes. Radiocarbon dating is used for geologically young materials containing organic carbon. == Geological development of an area == The geology of an area changes through time as rock units are deposited and inserted, and deformational processes change their shapes and locations. Rock units are first emplaced either by deposition onto the surface or intrusion into the overlying rock. Deposition can occur when sediments settle onto the surface of the Earth and later lithify into sedimentary rock, or when as volcanic material such as volcanic ash or lava flows blanket the surface. Igneous intrusions such as batholiths, laccoliths, dikes, and sills, push upwards into the overlying rock, and crystallize as they intrude. After the initial sequence of rocks has been deposited, the rock units can be deformed and/or metamorphosed. Deformation typically occurs as a result of horizontal shortening, horizontal extension, or side-to-side (strike-slip) motion. These structural regimes broadly relate to convergent boundaries, divergent boundaries, and transform boundaries, respectively, between tectonic plates. When rock units are placed under horizontal compression, they shorten and become thicker. Because rock units, other than muds, do not significantly change in volume, this is accomplished in two primary ways: through faulting and folding. In the shallow crust, where brittle deformation can occur, thrust faults form, which causes the deeper rock to move on top of the shallower rock. Because deeper rock is often older, as noted by the principle of superposition, this can result in older rocks moving on top of younger ones. Movement along faults can result in folding, either because the faults are not planar or because rock layers are dragged along, forming drag folds as slip occurs along the fault. Deeper in the Earth, rocks behave plastically and fold instead of faulting. These folds can either be those where the material in the center of the fold buckles upwards, creating "antiforms", or where it buckles downwards, creating "synforms". If the tops of the rock units within the folds remain pointing upwards, they are called anticlines and synclines, respectively. If some of the units in the fold are facing downward, the structure is called an overturned anticline or syncline, and if all of the rock units are overturned or the correct up-direction is unknown, they are simply called by the most general terms, antiforms, and synforms. Even higher pressures and temperatures during horizontal shortening can cause both folding and metamorphism of the rocks. This metamorphism causes changes in the mineral composition of the rocks; creates a foliation, or planar surface, that is related to mineral growth under stress. This can remove signs of the original textures of the rocks, such as bedding in sedimentary rocks, flow features of lavas, and crystal patterns in crystalline rocks. Extension causes the rock units as a whole to become longer and thinner. This is primarily accomplished through normal faulting and through the ductile stretching and thinning. Normal faults drop rock units that are higher below those that are lower. This typically results in younger units ending up below older units. Stretching of units can result in their thinning. In fact, at one location within the Maria Fold and Thrust Belt, the entire sedimentary sequence of the Grand Canyon appears over a length of less than a meter. Rocks at the depth to be ductilely stretched are often also metamorphosed. These stretched rocks can also pinch into lenses, known as boudins, after the French word for "sausage" because of their visual similarity. Where rock units slide past one another, strike-slip faults develop in shallow regions, and become shear zones at deeper depths where the rocks deform ductilely. The addition of new rock units, both depositionally and intrusively, often occurs during deformation. Faulting and other deformational processes result in the creation of topographic gradients, causing material on the rock unit that is increasing in elevation to be eroded by hillslopes and channels. These sediments are deposited on the rock unit that is going down. Continual motion along the fault maintains the topographic gradient in spite of the movement of sediment and continues to create accommodation space for the material to deposit. Deformational events are often also associated with volcanism and igneous activity. Volcanic ashes and lavas accumulate on the surface, and igneous intrusions enter from below. Dikes, long, planar igneous intrusions, enter along cracks, and therefore often form in large numbers in areas that are being actively deformed. This can result in the emplacement of dike swarms, such as those that are observable across the Canadian shield, or rings of dikes around the lava tube of a volcano. All of these processes do not necessarily occur in a single environment and do not necessarily occur in a single order. The Hawaiian Islands, for example, consist almost entirely of layered basaltic lava flows. The sedimentary sequences of the mid-continental United States and the Grand Canyon in the southwestern United States contain almost-undeformed stacks of sedimentary rocks that have remained in place since Cambrian time. Other areas are much more geologically complex. In the southwestern United States, sedimentary, volcanic, and intrusive rocks have been metamorphosed, faulted, foliated, and folded. Even older rocks, such as the Acasta gneiss of the Slave craton in northwestern Canada, the oldest known rock in the world have been metamorphosed to the point where their origin is indiscernible without laboratory analysis. In addition, these processes can occur in stages. In many places, the Grand Canyon in the southwestern United States being a very visible example, the lower rock units were metamorphosed and deformed, and then deformation ended and the upper, undeformed units were deposited. Although any amount of rock emplacement and rock deformation can occur, and they can occur any number of times, these concepts provide a guide to understanding the geological history of an area. == Methods of geology == Geologists use a number of fields, laboratory, and numerical modeling methods to decipher Earth history and to understand the processes that occur on and inside the Earth. In typical geological investigations, geologists use primary information related to petrology (the study of rocks), stratigraphy (the study of sedimentary layers), and structural geology (the study of positions of rock units and their deformation). In many cases, geologists also study modern soils, rivers, landscapes, and glaciers; investigate past and current life and biogeochemical pathways, and use geophysical methods to investigate the subsurface. Sub-specialities of geology may distinguish endogenous and exogenous geology. === Field methods === Geological field work varies depending on the task at hand. Typical fieldwork could consist of: Geological mappingStructural mapping: identifying the locations of major rock units and the faults and folds that led to their placement there. Stratigraphic mapping: pinpointing the locations of sedimentary facies (lithofacies and biofacies) or the mapping of isopachs of equal thickness of sedimentary rock Surficial mapping: recording the locations of soils and surficial deposits Surveying of topographic features compilation of topographic maps Work to understand change across landscapes, including: Patterns of erosion and deposition River-channel change through migration and avulsion Hillslope processes Subsurface mapping through geophysical methodsThese methods include: Shallow seismic surveys Ground-penetrating radar Aeromagnetic surveys Electrical resistivity tomography They aid in: Hydrocarbon exploration Finding groundwater Locating buried archaeological artifacts High-resolution stratigraphy Measuring and describing stratigraphic sections on the surface Well drilling and logging Biogeochemistry and geomicrobiologyCollecting samples to: determine biochemical pathways identify new species of organisms identify new chemical compounds and to use these discoveries to: understand early life on Earth and how it functioned and metabolized find important compounds for use in pharmaceuticals Paleontology: excavation of fossil material For research into past life and evolution For museums and education Collection of samples for geochronology and thermochronology Glaciology: measurement of characteristics of glaciers and their motion === Petrology === In addition to identifying rocks in the field (lithology), petrologists identify rock samples in the laboratory. Two of the primary methods for identifying rocks in the laboratory are through optical microscopy and by using an electron microprobe. In an optical mineralogy analysis, petrologists analyze thin sections of rock samples using a petrographic microscope, where the minerals can be identified through their different properties in plane-polarized and cross-polarized light, including their birefringence, pleochroism, twinning, and interference properties with a conoscopic lens. In the electron microprobe, individual locations are analyzed for their exact chemical compositions and variation in composition within individual crystals. Stable and radioactive isotope studies provide insight into the geochemical evolution of rock units. Petrologists can also use fluid inclusion data and perform high temperature and pressure physical experiments to understand the temperatures and pressures at which different mineral phases appear, and how they change through igneous and metamorphic processes. This research can be extrapolated to the field to understand metamorphic processes and the conditions of crystallization of igneous rocks. This work can also help to explain processes that occur within the Earth, such as subduction and magma chamber evolution. === Structural geology === Structural geologists use microscopic analysis of oriented thin sections of geological samples to observe the fabric within the rocks, which gives information about strain within the crystalline structure of the rocks. They also plot and combine measurements of geological structures to better understand the orientations of faults and folds to reconstruct the history of rock deformation in the area. In addition, they perform analog and numerical experiments of rock deformation in large and small settings. The analysis of structures is often accomplished by plotting the orientations of various features onto stereonets. A stereonet is a stereographic projection of a sphere onto a plane, in which planes are projected as lines and lines are projected as points. These can be used to find the locations of fold axes, relationships between faults, and relationships between other geological structures. Among the most well-known experiments in structural geology are those involving orogenic wedges, which are zones in which mountains are built along convergent tectonic plate boundaries. In the analog versions of these experiments, horizontal layers of sand are pulled along a lower surface into a back stop, which results in realistic-looking patterns of faulting and the growth of a critically tapered (all angles remain the same) orogenic wedge. Numerical models work in the same way as these analog models, though they are often more sophisticated and can include patterns of erosion and uplift in the mountain belt. This helps to show the relationship between erosion and the shape of a mountain range. These studies can also give useful information about pathways for metamorphism through pressure, temperature, space, and time. === Stratigraphy === In the laboratory, stratigraphers analyze samples of stratigraphic sections that can be returned from the field, such as those from drill cores. Stratigraphers also analyze data from geophysical surveys that show the locations of stratigraphic units in the subsurface. Geophysical data and well logs can be combined to produce a better view of the subsurface, and stratigraphers often use computer programs to do this in three dimensions. Stratigraphers can then use these data to reconstruct ancient processes occurring on the surface of the Earth, interpret past environments, and locate areas for water, coal, and hydrocarbon extraction. In the laboratory, biostratigraphers analyze rock samples from outcrop and drill cores for the fossils found in them. These fossils help scientists to date the core and to understand the depositional environment in which the rock units formed. Geochronologists precisely date rocks within the stratigraphic section to provide better absolute bounds on the timing and rates of deposition. Magnetic stratigraphers look for signs of magnetic reversals in igneous rock units within the drill cores. Other scientists perform stable-isotope studies on the rocks to gain information about past climate. == Planetary geology == With the advent of space exploration in the twentieth century, geologists have begun to look at other planetary bodies in the same ways that have been developed to study the Earth. This new field of study is called planetary geology (sometimes known as astrogeology) and relies on known geological principles to study other bodies of the solar system. This is a major aspect of planetary science, and largely focuses on the terrestrial planets, icy moons, asteroids, comets, and meteorites. However, some planetary geophysicists study the giant planets and exoplanets.Although the Greek-language-origin prefix geo refers to Earth, "geology" is often used in conjunction with the names of other planetary bodies when describing their composition and internal processes: examples are "the geology of Mars" and "Lunar geology". Specialized terms such as selenology (studies of the Moon), areology (of Mars), etc., are also in use. Although planetary geologists are interested in studying all aspects of other planets, a significant focus is to search for evidence of past or present life on other worlds. This has led to many missions whose primary or ancillary purpose is to examine planetary bodies for evidence of life. One of these is the Phoenix lander, which analyzed Martian polar soil for water, chemical, and mineralogical constituents related to biological processes. == Applied geology == === Economic geology === Economic geology is a branch of geology that deals with aspects of economic minerals that humankind uses to fulfill various needs. Economic minerals are those extracted profitably for various practical uses. Economic geologists help locate and manage the Earth's natural resources, such as petroleum and coal, as well as mineral resources, which include metals such as iron, copper, and uranium. ==== Mining geology ==== Mining geology consists of the extractions of mineral resources from the Earth. Some resources of economic interests include gemstones, metals such as gold and copper, and many minerals such as asbestos, perlite, mica, phosphates, zeolites, clay, pumice, quartz, and silica, as well as elements such as sulfur, chlorine, and helium. ==== Petroleum geology ==== Petroleum geologists study the locations of the subsurface of the Earth that can contain extractable hydrocarbons, especially petroleum and natural gas. Because many of these reservoirs are found in sedimentary basins, they study the formation of these basins, as well as their sedimentary and tectonic evolution and the present-day positions of the rock units. === Engineering geology === Engineering geology is the application of geological principles to engineering practice for the purpose of assuring that the geological factors affecting the location, design, construction, operation, and maintenance of engineering works are properly addressed. Engineering geology is distinct from geological engineering, particularly in North America. In the field of civil engineering, geological principles and analyses are used in order to ascertain the mechanical principles of the material on which structures are built. This allows tunnels to be built without collapsing, bridges and skyscrapers to be built with sturdy foundations, and buildings to be built that will not settle in clay and mud. === Hydrology === Geology and geological principles can be applied to various environmental problems such as stream restoration, the restoration of brownfields, and the understanding of the interaction between natural habitat and the geological environment. Groundwater hydrology, or hydrogeology, is used to locate groundwater, which can often provide a ready supply of uncontaminated water and is especially important in arid regions, and to monitor the spread of contaminants in groundwater wells. === Paleoclimatology === Geologists also obtain data through stratigraphy, boreholes, core samples, and ice cores. Ice cores and sediment cores are used for paleoclimate reconstructions, which tell geologists about past and present temperature, precipitation, and sea level across the globe. These datasets are our primary source of information on global climate change outside of instrumental data. === Natural hazards === Geologists and geophysicists study natural hazards in order to enact safe building codes and warning systems that are used to prevent loss of property and life. Examples of important natural hazards that are pertinent to geology (as opposed those that are mainly or only pertinent to meteorology) are: == History == The study of the physical material of the Earth dates back at least to ancient Greece when Theophrastus (372–287 BCE) wrote the work Peri Lithon (On Stones). During the Roman period, Pliny the Elder wrote in detail of the many minerals and metals, then in practical use – even correctly noting the origin of amber. Additionally, in the 4th century BCE Aristotle made critical observations of the slow rate of geological change. He observed the composition of the land and formulated a theory where the Earth changes at a slow rate and that these changes cannot be observed during one person's lifetime. Aristotle developed one of the first evidence-based concepts connected to the geological realm regarding the rate at which the Earth physically changes.Abu al-Rayhan al-Biruni (973–1048 CE) was one of the earliest Persian geologists, whose works included the earliest writings on the geology of India, hypothesizing that the Indian subcontinent was once a sea. Drawing from Greek and Indian scientific literature that were not destroyed by the Muslim conquests, the Persian scholar Ibn Sina (Avicenna, 981–1037) proposed detailed explanations for the formation of mountains, the origin of earthquakes, and other topics central to modern geology, which provided an essential foundation for the later development of the science. In China, the polymath Shen Kuo (1031–1095) formulated a hypothesis for the process of land formation: based on his observation of fossil animal shells in a geological stratum in a mountain hundreds of miles from the ocean, he inferred that the land was formed by the erosion of the mountains and by deposition of silt.Nicolas Steno (1638–1686) is credited with the law of superposition, the principle of original horizontality, and the principle of lateral continuity: three defining principles of stratigraphy. The word geology was first used by Ulisse Aldrovandi in 1603, then by Jean-André Deluc in 1778 and introduced as a fixed term by Horace-Bénédict de Saussure in 1779. The word is derived from the Greek γῆ, gê, meaning "earth" and λόγος, logos, meaning "speech". But according to another source, the word "geology" comes from a Norwegian, Mikkel Pedersøn Escholt (1600–1699), who was a priest and scholar. Escholt first used the definition in his book titled, Geologia Norvegica (1657).William Smith (1769–1839) drew some of the first geological maps and began the process of ordering rock strata (layers) by examining the fossils contained in them.In 1763, Mikhail Lomonosov published his treatise On the Strata of Earth. His work was the first narrative of modern geology, based on the unity of processes in time and explanation of the Earth's past from the present.James Hutton (1726-1797) is often viewed as the first modern geologist. In 1785 he presented a paper entitled Theory of the Earth to the Royal Society of Edinburgh. In his paper, he explained his theory that the Earth must be much older than had previously been supposed to allow enough time for mountains to be eroded and for sediments to form new rocks at the bottom of the sea, which in turn were raised up to become dry land. Hutton published a two-volume version of his ideas in 1795.Followers of Hutton were known as Plutonists because they believed that some rocks were formed by vulcanism, which is the deposition of lava from volcanoes, as opposed to the Neptunists, led by Abraham Werner, who believed that all rocks had settled out of a large ocean whose level gradually dropped over time. The first geological map of the U.S. was produced in 1809 by William Maclure. In 1807, Maclure commenced the self-imposed task of making a geological survey of the United States. Almost every state in the Union was traversed and mapped by him, the Allegheny Mountains being crossed and recrossed some 50 times. The results of his unaided labours were submitted to the American Philosophical Society in a memoir entitled Observations on the Geology of the United States explanatory of a Geological Map, and published in the Society's Transactions, together with the nation's first geological map. This antedates William Smith's geological map of England by six years, although it was constructed using a different classification of rocks. Sir Charles Lyell (1797-1875) first published his famous book, Principles of Geology, in 1830. This book, which influenced the thought of Charles Darwin, successfully promoted the doctrine of uniformitarianism. This theory states that slow geological processes have occurred throughout the Earth's history and are still occurring today. In contrast, catastrophism is the theory that Earth's features formed in single, catastrophic events and remained unchanged thereafter. Though Hutton believed in uniformitarianism, the idea was not widely accepted at the time. Much of 19th-century geology revolved around the question of the Earth's exact age. Estimates varied from a few hundred thousand to billions of years. By the early 20th century, radiometric dating allowed the Earth's age to be estimated at two billion years. The awareness of this vast amount of time opened the door to new theories about the processes that shaped the planet. Some of the most significant advances in 20th-century geology have been the development of the theory of plate tectonics in the 1960s and the refinement of estimates of the planet's age. Plate tectonics theory arose from two separate geological observations: seafloor spreading and continental drift. The theory revolutionized the Earth sciences. Today the Earth is known to be approximately 4.5 billion years old. == Fields or related disciplines == == See also == == References == == External links == One Geology: This interactive geological map of the world is an international initiative of the geological surveys around the globe. This groundbreaking project was launched in 2007 and contributed to the 'International Year of Planet Earth', becoming one of their flagship projects. Earth Science News, Maps, Dictionary, Articles, Jobs American Geophysical Union American Geosciences Institute European Geosciences Union Geological Society of America Geological Society of London Video-interviews with famous geologists Geology OpenTextbook Chronostratigraphy benchmarks ====================================== Geology (from Ancient Greek γῆ (gê) 'earth', and λoγία (-logía) 'study of, discourse') is a branch of natural science concerned with Earth and other astronomical objects, the features or rocks of which it is composed, and the processes by which they change over time. Modern geology significantly overlaps all other Earth sciences, including hydrology, and so is treated as one major aspect of integrated Earth system science and planetary science. Geology describes the structure of the Earth on and beneath its surface, and the processes that have shaped that structure. It also provides tools to determine the relative and absolute ages of rocks found in a given location, and also to describe the histories of those rocks. By combining these tools, geologists are able to chronicle the geological history of the Earth as a whole, and also to demonstrate the age of the Earth. Geology provides the primary evidence for plate tectonics, the evolutionary history of life, and the Earth's past climates. Geologists broadly study the properties and processes of Earth and other terrestrial planets and predominantly solid planetary bodies. Geologists use a wide variety of methods to understand the Earth's structure and evolution, including field work, rock description, geophysical techniques, chemical analysis, physical experiments, and numerical modelling. In practical terms, geology is important for mineral and hydrocarbon exploration and exploitation, evaluating water resources, understanding natural hazards, the remediation of environmental problems, and providing insights into past climate change. Geology is a major academic discipline, and it is central to geological engineering and plays an important role in geotechnical engineering. == Geological material == The majority of geological data comes from research on solid Earth materials. Meteorites and other extraterrestrial natural materials are also studied by geological methods. === Mineral === Minerals are natural occurring elements and compounds with a definite homogeneous chemical composition and ordered atomic composition. Each mineral has distinct physical properties, and there are many tests to determine each of them. The specimens can be tested for: Luster: Quality of light reflected from the surface of a mineral. Examples are metallic, pearly, waxy, dull. Color: Minerals are grouped by their color. Mostly diagnostic but impurities can change a mineral's color. Streak: Performed by scratching the sample on a porcelain plate. The color of the streak can help name the mineral. Hardness: The resistance of a mineral to scratching. Breakage pattern: A mineral can either show fracture or cleavage, the former being breakage of uneven surfaces, and the latter a breakage along closely spaced parallel planes. Specific gravity: the weight of a specific volume of a mineral. Effervescence: Involves dripping hydrochloric acid on the mineral to test for fizzing. Magnetism: Involves using a magnet to test for magnetism. Taste: Minerals can have a distinctive taste, such as halite (which tastes like table salt). === Rock === A rock is any naturally occurring solid mass or aggregate of minerals or mineraloids. Most research in geology is associated with the study of rocks, as they provide the primary record of the majority of the geological history of the Earth. There are three major types of rock: igneous, sedimentary, and metamorphic. The rock cycle illustrates the relationships among them (see diagram). When a rock solidifies or crystallizes from melt (magma or lava), it is an igneous rock. This rock can be weathered and eroded, then redeposited and lithified into a sedimentary rock. It can then be turned into a metamorphic rock by heat and pressure that change its mineral content, resulting in a characteristic fabric. All three types may melt again, and when this happens, new magma is formed, from which an igneous rock may once more solidify. Organic matter, such as coal, bitumen, oil and natural gas, is linked mainly to organic-rich sedimentary rocks. To study all three types of rock, geologists evaluate the minerals of which they are composed and their other physical properties, such as texture and fabric. === Unlithified material === Geologists also study unlithified materials (referred to as superficial deposits) that lie above the bedrock. This study is often known as Quaternary geology, after the Quaternary period of geologic history, which is the most recent period of geologic time. ==== Magma ==== Magma is the original unlithified source of all igneous rocks. The active flow of molten rock is closely studied in volcanology, and igneous petrology aims to determine the history of igneous rocks from their original molten source to their final crystallization. == Whole-Earth structure == === Plate tectonics === In the 1960s, it was discovered that the Earth's lithosphere, which includes the crust and rigid uppermost portion of the upper mantle, is separated into tectonic plates that move across the plastically deforming, solid, upper mantle, which is called the asthenosphere. This theory is supported by several types of observations, including seafloor spreading and the global distribution of mountain terrain and seismicity. There is an intimate coupling between the movement of the plates on the surface and the convection of the mantle (that is, the heat transfer caused by the slow movement of ductile mantle rock). Thus, oceanic plates and the adjoining mantle convection currents always move in the same direction – because the oceanic lithosphere is actually the rigid upper thermal boundary layer of the convecting mantle. This coupling between rigid plates moving on the surface of the Earth and the convecting mantle is called plate tectonics. The development of plate tectonics has provided a physical basis for many observations of the solid Earth. Long linear regions of geological features are explained as plate boundaries. For example: Mid-ocean ridges, high regions on the seafloor where hydrothermal vents and volcanoes exist, are seen as divergent boundaries, where two plates move apart. Arcs of volcanoes and earthquakes are theorized as convergent boundaries, where one plate subducts, or moves, under another.Transform boundaries, such as the San Andreas Fault system, resulted in widespread powerful earthquakes. Plate tectonics also has provided a mechanism for Alfred Wegener's theory of continental drift, in which the continents move across the surface of the Earth over geological time. They also provided a driving force for crustal deformation, and a new setting for the observations of structural geology. The power of the theory of plate tectonics lies in its ability to combine all of these observations into a single theory of how the lithosphere moves over the convecting mantle. === Earth structure === Advances in seismology, computer modeling, and mineralogy and crystallography at high temperatures and pressures give insights into the internal composition and structure of the Earth. Seismologists can use the arrival times of seismic waves to image the interior of the Earth. Early advances in this field showed the existence of a liquid outer core (where shear waves were not able to propagate) and a dense solid inner core. These advances led to the development of a layered model of the Earth, with a crust and lithosphere on top, the mantle below (separated within itself by seismic discontinuities at 410 and 660 kilometers), and the outer core and inner core below that. More recently, seismologists have been able to create detailed images of wave speeds inside the earth in the same way a doctor images a body in a CT scan. These images have led to a much more detailed view of the interior of the Earth, and have replaced the simplified layered model with a much more dynamic model. Mineralogists have been able to use the pressure and temperature data from the seismic and modeling studies alongside knowledge of the elemental composition of the Earth to reproduce these conditions in experimental settings and measure changes in crystal structure. These studies explain the chemical changes associated with the major seismic discontinuities in the mantle and show the crystallographic structures expected in the inner core of the Earth. == Geological time == The geological time scale encompasses the history of the Earth. It is bracketed at the earliest by the dates of the first Solar System material at 4.567 Ga (or 4.567 billion years ago) and the formation of the Earth at 4.54 Ga (4.54 billion years), which is the beginning of the informally recognized Hadean eon – a division of geological time. At the later end of the scale, it is marked by the present day (in the Holocene epoch). === Timescale of the Earth === The following five timelines show the geologic time scale to scale. The first shows the entire time from the formation of the Earth to the present, but this gives little space for the most recent eon. The second timeline shows an expanded view of the most recent eon. In a similar way, the most recent era is expanded in the third timeline, the most recent period is expanded in the fourth timeline, and the most recent epoch is expanded in the fifth timeline. === Important milestones on Earth === 4.567 Ga (gigaannum: billion years ago): Solar system formation 4.54 Ga: Accretion, or formation, of Earth c. 4 Ga: End of Late Heavy Bombardment, the first life c. 3.5 Ga: Start of photosynthesis c. 2.3 Ga: Oxygenated atmosphere, first snowball Earth 730–635 Ma (megaannum: million years ago): second snowball Earth 541 ± 0.3 Ma: Cambrian explosion – vast multiplication of hard-bodied life; first abundant fossils; start of the Paleozoic c. 380 Ma: First vertebrate land animals 250 Ma: Permian-Triassic extinction – 90% of all land animals die; end of Paleozoic and beginning of Mesozoic 66 Ma: Cretaceous–Paleogene extinction – Dinosaurs die; end of Mesozoic and beginning of Cenozoic c. 7 Ma: First hominins appear 3.9 Ma: First Australopithecus, direct ancestor to modern Homo sapiens, appear 200 ka (kiloannum: thousand years ago): First modern Homo sapiens appear in East Africa === Timescale of the Moon === === Timescale of Mars === == Dating methods == === Relative dating === Methods for relative dating were developed when geology first emerged as a natural science. Geologists still use the following principles today as a means to provide information about geological history and the timing of geological events. The principle of uniformitarianism states that the geological processes observed in operation that modify the Earth's crust at present have worked in much the same way over geological time. A fundamental principle of geology advanced by the 18th-century Scottish physician and geologist James Hutton is that "the present is the key to the past." In Hutton's words: "the past history of our globe must be explained by what can be seen to be happening now."The principle of intrusive relationships concerns crosscutting intrusions. In geology, when an igneous intrusion cuts across a formation of sedimentary rock, it can be determined that the igneous intrusion is younger than the sedimentary rock. Different types of intrusions include stocks, laccoliths, batholiths, sills and dikes. The principle of cross-cutting relationships pertains to the formation of faults and the age of the sequences through which they cut. Faults are younger than the rocks they cut; accordingly, if a fault is found that penetrates some formations but not those on top of it, then the formations that were cut are older than the fault, and the ones that are not cut must be younger than the fault. Finding the key bed in these situations may help determine whether the fault is a normal fault or a thrust fault.The principle of inclusions and components states that, with sedimentary rocks, if inclusions (or clasts) are found in a formation, then the inclusions must be older than the formation that contains them. For example, in sedimentary rocks, it is common for gravel from an older formation to be ripped up and included in a newer layer. A similar situation with igneous rocks occurs when xenoliths are found. These foreign bodies are picked up as magma or lava flows, and are incorporated, later to cool in the matrix. As a result, xenoliths are older than the rock that contains them. The principle of original horizontality states that the deposition of sediments occurs as essentially horizontal beds. Observation of modern marine and non-marine sediments in a wide variety of environments supports this generalization (although cross-bedding is inclined, the overall orientation of cross-bedded units is horizontal).The principle of superposition states that a sedimentary rock layer in a tectonically undisturbed sequence is younger than the one beneath it and older than the one above it. Logically a younger layer cannot slip beneath a layer previously deposited. This principle allows sedimentary layers to be viewed as a form of the vertical timeline, a partial or complete record of the time elapsed from deposition of the lowest layer to deposition of the highest bed.The principle of faunal succession is based on the appearance of fossils in sedimentary rocks. As organisms exist during the same period throughout the world, their presence or (sometimes) absence provides a relative age of the formations where they appear. Based on principles that William Smith laid out almost a hundred years before the publication of Charles Darwin's theory of evolution, the principles of succession developed independently of evolutionary thought. The principle becomes quite complex, however, given the uncertainties of fossilization, localization of fossil types due to lateral changes in habitat (facies change in sedimentary strata), and that not all fossils formed globally at the same time. === Absolute dating === Geologists also use methods to determine the absolute age of rock samples and geological events. These dates are useful on their own and may also be used in conjunction with relative dating methods or to calibrate relative methods.At the beginning of the 20th century, advancement in geological science was facilitated by the ability to obtain accurate absolute dates to geological events using radioactive isotopes and other methods. This changed the understanding of geological time. Previously, geologists could only use fossils and stratigraphic correlation to date sections of rock relative to one another. With isotopic dates, it became possible to assign absolute ages to rock units, and these absolute dates could be applied to fossil sequences in which there was datable material, converting the old relative ages into new absolute ages. For many geological applications, isotope ratios of radioactive elements are measured in minerals that give the amount of time that has passed since a rock passed through its particular closure temperature, the point at which different radiometric isotopes stop diffusing into and out of the crystal lattice. These are used in geochronologic and thermochronologic studies. Common methods include uranium–lead dating, potassium–argon dating, argon–argon dating and uranium–thorium dating. These methods are used for a variety of applications. Dating of lava and volcanic ash layers found within a stratigraphic sequence can provide absolute age data for sedimentary rock units that do not contain radioactive isotopes and calibrate relative dating techniques. These methods can also be used to determine ages of pluton emplacement. Thermochemical techniques can be used to determine temperature profiles within the crust, the uplift of mountain ranges, and paleo-topography. Fractionation of the lanthanide series elements is used to compute ages since rocks were removed from the mantle. Other methods are used for more recent events. Optically stimulated luminescence and cosmogenic radionuclide dating are used to date surfaces and/or erosion rates. Dendrochronology can also be used for the dating of landscapes. Radiocarbon dating is used for geologically young materials containing organic carbon. == Geological development of an area == The geology of an area changes through time as rock units are deposited and inserted, and deformational processes change their shapes and locations. Rock units are first emplaced either by deposition onto the surface or intrusion into the overlying rock. Deposition can occur when sediments settle onto the surface of the Earth and later lithify into sedimentary rock, or when as volcanic material such as volcanic ash or lava flows blanket the surface. Igneous intrusions such as batholiths, laccoliths, dikes, and sills, push upwards into the overlying rock, and crystallize as they intrude. After the initial sequence of rocks has been deposited, the rock units can be deformed and/or metamorphosed. Deformation typically occurs as a result of horizontal shortening, horizontal extension, or side-to-side (strike-slip) motion. These structural regimes broadly relate to convergent boundaries, divergent boundaries, and transform boundaries, respectively, between tectonic plates. When rock units are placed under horizontal compression, they shorten and become thicker. Because rock units, other than muds, do not significantly change in volume, this is accomplished in two primary ways: through faulting and folding. In the shallow crust, where brittle deformation can occur, thrust faults form, which causes the deeper rock to move on top of the shallower rock. Because deeper rock is often older, as noted by the principle of superposition, this can result in older rocks moving on top of younger ones. Movement along faults can result in folding, either because the faults are not planar or because rock layers are dragged along, forming drag folds as slip occurs along the fault. Deeper in the Earth, rocks behave plastically and fold instead of faulting. These folds can either be those where the material in the center of the fold buckles upwards, creating "antiforms", or where it buckles downwards, creating "synforms". If the tops of the rock units within the folds remain pointing upwards, they are called anticlines and synclines, respectively. If some of the units in the fold are facing downward, the structure is called an overturned anticline or syncline, and if all of the rock units are overturned or the correct up-direction is unknown, they are simply called by the most general terms, antiforms, and synforms. Even higher pressures and temperatures during horizontal shortening can cause both folding and metamorphism of the rocks. This metamorphism causes changes in the mineral composition of the rocks; creates a foliation, or planar surface, that is related to mineral growth under stress. This can remove signs of the original textures of the rocks, such as bedding in sedimentary rocks, flow features of lavas, and crystal patterns in crystalline rocks. Extension causes the rock units as a whole to become longer and thinner. This is primarily accomplished through normal faulting and through the ductile stretching and thinning. Normal faults drop rock units that are higher below those that are lower. This typically results in younger units ending up below older units. Stretching of units can result in their thinning. In fact, at one location within the Maria Fold and Thrust Belt, the entire sedimentary sequence of the Grand Canyon appears over a length of less than a meter. Rocks at the depth to be ductilely stretched are often also metamorphosed. These stretched rocks can also pinch into lenses, known as boudins, after the French word for "sausage" because of their visual similarity. Where rock units slide past one another, strike-slip faults develop in shallow regions, and become shear zones at deeper depths where the rocks deform ductilely. The addition of new rock units, both depositionally and intrusively, often occurs during deformation. Faulting and other deformational processes result in the creation of topographic gradients, causing material on the rock unit that is increasing in elevation to be eroded by hillslopes and channels. These sediments are deposited on the rock unit that is going down. Continual motion along the fault maintains the topographic gradient in spite of the movement of sediment and continues to create accommodation space for the material to deposit. Deformational events are often also associated with volcanism and igneous activity. Volcanic ashes and lavas accumulate on the surface, and igneous intrusions enter from below. Dikes, long, planar igneous intrusions, enter along cracks, and therefore often form in large numbers in areas that are being actively deformed. This can result in the emplacement of dike swarms, such as those that are observable across the Canadian shield, or rings of dikes around the lava tube of a volcano. All of these processes do not necessarily occur in a single environment and do not necessarily occur in a single order. The Hawaiian Islands, for example, consist almost entirely of layered basaltic lava flows. The sedimentary sequences of the mid-continental United States and the Grand Canyon in the southwestern United States contain almost-undeformed stacks of sedimentary rocks that have remained in place since Cambrian time. Other areas are much more geologically complex. In the southwestern United States, sedimentary, volcanic, and intrusive rocks have been metamorphosed, faulted, foliated, and folded. Even older rocks, such as the Acasta gneiss of the Slave craton in northwestern Canada, the oldest known rock in the world have been metamorphosed to the point where their origin is indiscernible without laboratory analysis. In addition, these processes can occur in stages. In many places, the Grand Canyon in the southwestern United States being a very visible example, the lower rock units were metamorphosed and deformed, and then deformation ended and the upper, undeformed units were deposited. Although any amount of rock emplacement and rock deformation can occur, and they can occur any number of times, these concepts provide a guide to understanding the geological history of an area. == Methods of geology == Geologists use a number of fields, laboratory, and numerical modeling methods to decipher Earth history and to understand the processes that occur on and inside the Earth. In typical geological investigations, geologists use primary information related to petrology (the study of rocks), stratigraphy (the study of sedimentary layers), and structural geology (the study of positions of rock units and their deformation). In many cases, geologists also study modern soils, rivers, landscapes, and glaciers; investigate past and current life and biogeochemical pathways, and use geophysical methods to investigate the subsurface. Sub-specialities of geology may distinguish endogenous and exogenous geology. === Field methods === Geological field work varies depending on the task at hand. Typical fieldwork could consist of: Geological mappingStructural mapping: identifying the locations of major rock units and the faults and folds that led to their placement there. Stratigraphic mapping: pinpointing the locations of sedimentary facies (lithofacies and biofacies) or the mapping of isopachs of equal thickness of sedimentary rock Surficial mapping: recording the locations of soils and surficial deposits Surveying of topographic features compilation of topographic maps Work to understand change across landscapes, including: Patterns of erosion and deposition River-channel change through migration and avulsion Hillslope processes Subsurface mapping through geophysical methodsThese methods include: Shallow seismic surveys Ground-penetrating radar Aeromagnetic surveys Electrical resistivity tomography They aid in: Hydrocarbon exploration Finding groundwater Locating buried archaeological artifacts High-resolution stratigraphy Measuring and describing stratigraphic sections on the surface Well drilling and logging Biogeochemistry and geomicrobiologyCollecting samples to: determine biochemical pathways identify new species of organisms identify new chemical compounds and to use these discoveries to: understand early life on Earth and how it functioned and metabolized find important compounds for use in pharmaceuticals Paleontology: excavation of fossil material For research into past life and evolution For museums and education Collection of samples for geochronology and thermochronology Glaciology: measurement of characteristics of glaciers and their motion === Petrology === In addition to identifying rocks in the field (lithology), petrologists identify rock samples in the laboratory. Two of the primary methods for identifying rocks in the laboratory are through optical microscopy and by using an electron microprobe. In an optical mineralogy analysis, petrologists analyze thin sections of rock samples using a petrographic microscope, where the minerals can be identified through their different properties in plane-polarized and cross-polarized light, including their birefringence, pleochroism, twinning, and interference properties with a conoscopic lens. In the electron microprobe, individual locations are analyzed for their exact chemical compositions and variation in composition within individual crystals. Stable and radioactive isotope studies provide insight into the geochemical evolution of rock units. Petrologists can also use fluid inclusion data and perform high temperature and pressure physical experiments to understand the temperatures and pressures at which different mineral phases appear, and how they change through igneous and metamorphic processes. This research can be extrapolated to the field to understand metamorphic processes and the conditions of crystallization of igneous rocks. This work can also help to explain processes that occur within the Earth, such as subduction and magma chamber evolution. === Structural geology === Structural geologists use microscopic analysis of oriented thin sections of geological samples to observe the fabric within the rocks, which gives information about strain within the crystalline structure of the rocks. They also plot and combine measurements of geological structures to better understand the orientations of faults and folds to reconstruct the history of rock deformation in the area. In addition, they perform analog and numerical experiments of rock deformation in large and small settings. The analysis of structures is often accomplished by plotting the orientations of various features onto stereonets. A stereonet is a stereographic projection of a sphere onto a plane, in which planes are projected as lines and lines are projected as points. These can be used to find the locations of fold axes, relationships between faults, and relationships between other geological structures. Among the most well-known experiments in structural geology are those involving orogenic wedges, which are zones in which mountains are built along convergent tectonic plate boundaries. In the analog versions of these experiments, horizontal layers of sand are pulled along a lower surface into a back stop, which results in realistic-looking patterns of faulting and the growth of a critically tapered (all angles remain the same) orogenic wedge. Numerical models work in the same way as these analog models, though they are often more sophisticated and can include patterns of erosion and uplift in the mountain belt. This helps to show the relationship between erosion and the shape of a mountain range. These studies can also give useful information about pathways for metamorphism through pressure, temperature, space, and time. === Stratigraphy === In the laboratory, stratigraphers analyze samples of stratigraphic sections that can be returned from the field, such as those from drill cores. Stratigraphers also analyze data from geophysical surveys that show the locations of stratigraphic units in the subsurface. Geophysical data and well logs can be combined to produce a better view of the subsurface, and stratigraphers often use computer programs to do this in three dimensions. Stratigraphers can then use these data to reconstruct ancient processes occurring on the surface of the Earth, interpret past environments, and locate areas for water, coal, and hydrocarbon extraction. In the laboratory, biostratigraphers analyze rock samples from outcrop and drill cores for the fossils found in them. These fossils help scientists to date the core and to understand the depositional environment in which the rock units formed. Geochronologists precisely date rocks within the stratigraphic section to provide better absolute bounds on the timing and rates of deposition. Magnetic stratigraphers look for signs of magnetic reversals in igneous rock units within the drill cores. Other scientists perform stable-isotope studies on the rocks to gain information about past climate. == Planetary geology == With the advent of space exploration in the twentieth century, geologists have begun to look at other planetary bodies in the same ways that have been developed to study the Earth. This new field of study is called planetary geology (sometimes known as astrogeology) and relies on known geological principles to study other bodies of the solar system. This is a major aspect of planetary science, and largely focuses on the terrestrial planets, icy moons, asteroids, comets, and meteorites. However, some planetary geophysicists study the giant planets and exoplanets.Although the Greek-language-origin prefix geo refers to Earth, "geology" is often used in conjunction with the names of other planetary bodies when describing their composition and internal processes: examples are "the geology of Mars" and "Lunar geology". Specialized terms such as selenology (studies of the Moon), areology (of Mars), etc., are also in use. Although planetary geologists are interested in studying all aspects of other planets, a significant focus is to search for evidence of past or present life on other worlds. This has led to many missions whose primary or ancillary purpose is to examine planetary bodies for evidence of life. One of these is the Phoenix lander, which analyzed Martian polar soil for water, chemical, and mineralogical constituents related to biological processes. == Applied geology == === Economic geology === Economic geology is a branch of geology that deals with aspects of economic minerals that humankind uses to fulfill various needs. Economic minerals are those extracted profitably for various practical uses. Economic geologists help locate and manage the Earth's natural resources, such as petroleum and coal, as well as mineral resources, which include metals such as iron, copper, and uranium. ==== Mining geology ==== Mining geology consists of the extractions of mineral resources from the Earth. Some resources of economic interests include gemstones, metals such as gold and copper, and many minerals such as asbestos, perlite, mica, phosphates, zeolites, clay, pumice, quartz, and silica, as well as elements such as sulfur, chlorine, and helium. ==== Petroleum geology ==== Petroleum geologists study the locations of the subsurface of the Earth that can contain extractable hydrocarbons, especially petroleum and natural gas. Because many of these reservoirs are found in sedimentary basins, they study the formation of these basins, as well as their sedimentary and tectonic evolution and the present-day positions of the rock units. === Engineering geology === Engineering geology is the application of geological principles to engineering practice for the purpose of assuring that the geological factors affecting the location, design, construction, operation, and maintenance of engineering works are properly addressed. Engineering geology is distinct from geological engineering, particularly in North America. In the field of civil engineering, geological principles and analyses are used in order to ascertain the mechanical principles of the material on which structures are built. This allows tunnels to be built without collapsing, bridges and skyscrapers to be built with sturdy foundations, and buildings to be built that will not settle in clay and mud. === Hydrology === Geology and geological principles can be applied to various environmental problems such as stream restoration, the restoration of brownfields, and the understanding of the interaction between natural habitat and the geological environment. Groundwater hydrology, or hydrogeology, is used to locate groundwater, which can often provide a ready supply of uncontaminated water and is especially important in arid regions, and to monitor the spread of contaminants in groundwater wells. === Paleoclimatology === Geologists also obtain data through stratigraphy, boreholes, core samples, and ice cores. Ice cores and sediment cores are used for paleoclimate reconstructions, which tell geologists about past and present temperature, precipitation, and sea level across the globe. These datasets are our primary source of information on global climate change outside of instrumental data. === Natural hazards === Geologists and geophysicists study natural hazards in order to enact safe building codes and warning systems that are used to prevent loss of property and life. Examples of important natural hazards that are pertinent to geology (as opposed those that are mainly or only pertinent to meteorology) are: == History == The study of the physical material of the Earth dates back at least to ancient Greece when Theophrastus (372–287 BCE) wrote the work Peri Lithon (On Stones). During the Roman period, Pliny the Elder wrote in detail of the many minerals and metals, then in practical use – even correctly noting the origin of amber. Additionally, in the 4th century BCE Aristotle made critical observations of the slow rate of geological change. He observed the composition of the land and formulated a theory where the Earth changes at a slow rate and that these changes cannot be observed during one person's lifetime. Aristotle developed one of the first evidence-based concepts connected to the geological realm regarding the rate at which the Earth physically changes.Abu al-Rayhan al-Biruni (973–1048 CE) was one of the earliest Persian geologists, whose works included the earliest writings on the geology of India, hypothesizing that the Indian subcontinent was once a sea. Drawing from Greek and Indian scientific literature that were not destroyed by the Muslim conquests, the Persian scholar Ibn Sina (Avicenna, 981–1037) proposed detailed explanations for the formation of mountains, the origin of earthquakes, and other topics central to modern geology, which provided an essential foundation for the later development of the science. In China, the polymath Shen Kuo (1031–1095) formulated a hypothesis for the process of land formation: based on his observation of fossil animal shells in a geological stratum in a mountain hundreds of miles from the ocean, he inferred that the land was formed by the erosion of the mountains and by deposition of silt.Nicolas Steno (1638–1686) is credited with the law of superposition, the principle of original horizontality, and the principle of lateral continuity: three defining principles of stratigraphy. The word geology was first used by Ulisse Aldrovandi in 1603, then by Jean-André Deluc in 1778 and introduced as a fixed term by Horace-Bénédict de Saussure in 1779. The word is derived from the Greek γῆ, gê, meaning "earth" and λόγος, logos, meaning "speech". But according to another source, the word "geology" comes from a Norwegian, Mikkel Pedersøn Escholt (1600–1699), who was a priest and scholar. Escholt first used the definition in his book titled, Geologia Norvegica (1657).William Smith (1769–1839) drew some of the first geological maps and began the process of ordering rock strata (layers) by examining the fossils contained in them.In 1763, Mikhail Lomonosov published his treatise On the Strata of Earth. His work was the first narrative of modern geology, based on the unity of processes in time and explanation of the Earth's past from the present.James Hutton (1726-1797) is often viewed as the first modern geologist. In 1785 he presented a paper entitled Theory of the Earth to the Royal Society of Edinburgh. In his paper, he explained his theory that the Earth must be much older than had previously been supposed to allow enough time for mountains to be eroded and for sediments to form new rocks at the bottom of the sea, which in turn were raised up to become dry land. Hutton published a two-volume version of his ideas in 1795.Followers of Hutton were known as Plutonists because they believed that some rocks were formed by vulcanism, which is the deposition of lava from volcanoes, as opposed to the Neptunists, led by Abraham Werner, who believed that all rocks had settled out of a large ocean whose level gradually dropped over time. The first geological map of the U.S. was produced in 1809 by William Maclure. In 1807, Maclure commenced the self-imposed task of making a geological survey of the United States. Almost every state in the Union was traversed and mapped by him, the Allegheny Mountains being crossed and recrossed some 50 times. The results of his unaided labours were submitted to the American Philosophical Society in a memoir entitled Observations on the Geology of the United States explanatory of a Geological Map, and published in the Society's Transactions, together with the nation's first geological map. This antedates William Smith's geological map of England by six years, although it was constructed using a different classification of rocks. Sir Charles Lyell (1797-1875) first published his famous book, Principles of Geology, in 1830. This book, which influenced the thought of Charles Darwin, successfully promoted the doctrine of uniformitarianism. This theory states that slow geological processes have occurred throughout the Earth's history and are still occurring today. In contrast, catastrophism is the theory that Earth's features formed in single, catastrophic events and remained unchanged thereafter. Though Hutton believed in uniformitarianism, the idea was not widely accepted at the time. Much of 19th-century geology revolved around the question of the Earth's exact age. Estimates varied from a few hundred thousand to billions of years. By the early 20th century, radiometric dating allowed the Earth's age to be estimated at two billion years. The awareness of this vast amount of time opened the door to new theories about the processes that shaped the planet. Some of the most significant advances in 20th-century geology have been the development of the theory of plate tectonics in the 1960s and the refinement of estimates of the planet's age. Plate tectonics theory arose from two separate geological observations: seafloor spreading and continental drift. The theory revolutionized the Earth sciences. Today the Earth is known to be approximately 4.5 billion years old. == Fields or related disciplines == == See also == == References == == External links == One Geology: This interactive geological map of the world is an international initiative of the geological surveys around the globe. This groundbreaking project was launched in 2007 and contributed to the 'International Year of Planet Earth', becoming one of their flagship projects. Earth Science News, Maps, Dictionary, Articles, Jobs American Geophysical Union American Geosciences Institute European Geosciences Union Geological Society of America Geological Society of London Video-interviews with famous geologists Geology OpenTextbook Chronostratigraphy benchmarks
distance_matrix.query("topic == 'cs'")[[c for c in distance_matrix.columns if c == "cs"]].melt().describe()
value | |
---|---|
count | 97336.000000 |
mean | 0.780865 |
std | 0.191023 |
min | 0.000000 |
25% | 0.735456 |
50% | 0.837289 |
75% | 0.902842 |
max | 0.989425 |
distance_matrix.query("topic == 'biology'")[[c for c in distance_matrix.columns if c == "biology"]].melt().describe()
value | |
---|---|
count | 103823.000000 |
mean | 0.816193 |
std | 0.165813 |
min | 0.000000 |
25% | 0.773717 |
50% | 0.860893 |
75% | 0.914119 |
max | 0.989218 |
distance_matrix.query("topic == 'biology'")[[c for c in distance_matrix.columns if c == "cs"]].melt().describe()
value | |
---|---|
count | 99452.000000 |
mean | 0.923513 |
std | 0.050810 |
min | 0.454301 |
25% | 0.904013 |
50% | 0.934138 |
75% | 0.957650 |
max | 0.997616 |
By only looking at the intra-topic distances, we see that both groups seem to be similar concerning their internal structure.
Also, texts about computer science seem to be a little more similar to each other than those about biology.
By looking at the distance matrix, we can also see that there is a small group of (around 3) texts which are relatively close to other category.
Let's find out what they are about:
articles_dataset.iloc[
distance_matrix.query("topic == 'biology'")[[c for c in distance_matrix.columns if c == "cs"]].mean(axis=1).argsort()
].head(5)
url | title | summary | text | num_views | num_edits | categories | revision_id | topic | |
---|---|---|---|---|---|---|---|---|---|
9 | https://en.wikipedia.org/wiki/Computational_bi... | Computational biology | Computational biology refers to the use of dat... | Computational biology refers to the use of dat... | 925265 | 480 | All articles with unsourced statements, Articl... | 1131620059 | biology |
14 | https://en.wikipedia.org/wiki/Systems_biology | Systems biology | Systems biology is the computational and mathe... | Systems biology is the computational and mathe... | 786795 | 1253 | All articles with style issues, All articles w... | 1129297987 | biology |
35 | https://en.wikipedia.org/wiki/Mathematical_and... | Mathematical and theoretical biology | Mathematical and theoretical biology, or bioma... | Mathematical and theoretical biology, or bioma... | 651088 | 897 | All articles lacking reliable references, All ... | 1123208062 | biology |
46 | https://en.wikipedia.org/wiki/Antenna_(biology) | Antenna (biology) | Antennae (sg. antenna), sometimes referred to ... | Antennae (sg. antenna), sometimes referred to ... | 678988 | 310 | All articles with unsourced statements, Arthro... | 1132658967 | biology |
19 | https://en.wikipedia.org/wiki/Homology_(biology) | Homology (biology) | In biology, homology is similarity due to shar... | In biology, homology is similarity due to shar... | 1844028 | 1229 | All articles with unsourced statements, Articl... | 1133813026 | biology |
Clustering is the task of automatically finding groups (aka clusters) of similar objects in your data.
Suppose, for example, we wouldn't have labels for our wiki dataset, we expect clustering to reveal that there are two types of articles.
More detailed the objective of clustering is two-fold:
Clustering is the task of finding a finite number of clusters in a dataset, such that:
Clustering is the most widely used technique of unsupervised learning.
Unsupervised learning is a category of techniques that find patterns in your data without requiring manual supervision (aka labelling of objects as examples).
There are many different approaches to clustering, to broadly categorize them we can apply a differentiation based on their notion of similarity.
Additionally, there is one important practical difference between clustering algorithms: How they find the number of clusters.
Most clustering techniques treat the number of clusters as a hyperparameter, and the user has to decide upfront how many clusters there are (or they want to find).
A small set of clustering techniques can infer the number of clusters. But in those cases, the number of found clusters highly depends on other hyperparameters, so it should also be taken with caution.
Scikit-Learn
¶Scikit-Learn
is Python's most popular library for all sorts of traditional (i.e., non-deep-learning) machine learning.
It covers a wide variety of fundamental and robust models, and algorithms into a package with a coherent API, that alleviates quickly replacing or updating parts of an ML-pipeline.
Additionally, it provides tools for feature extraction and evaluation.
Like pandas
, numpy
or seaborn
you have to install it manually
pip | conda install scikit-learn
In constrast, to the other packages we already used, you rarely need everything that Scikit-Learn
offers at once.
So instead of importing the whole library you'll import only those parts you need:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.datasets import fetch_20newsgroups
from sklearn.svm import LinearSVC
from sklearn.metrics import classification_report
...
As you can see the library is split up into multiple submodules.
Most of `Scikit-Learns´ models and feature extraction routines come as classes.
These classes follow the same API.
For example, if you want to train a classifier to distinguish between texts about computer science or biology, you simply import the type of classifier you want to use, and train it with one convenient method-call:
Note: For now don't worry about what we are doing in this example, this example only serves the purpose of showing how the API works!
from sklearn.linear_model import LogisticRegression
classifier = LogisticRegression()
classifier.fit(np.asarray(freqs), articles_dataset["topic"])
LogisticRegression()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
LogisticRegression()
By calling the fit method and passing a set of articles in DTM format and their categories (=> labels) as arguments, the model is trained to differentiate texts about CS and Biology.
After the training is finished, the model may expose new attributes that end with an underscore _
Those are the parameters that have been learned during training:
classifier.coef_.shape
classifier.coef_
array([[-1.11633237e-02, -1.47143594e-03, 3.66969873e-02, 4.46922925e-02, 1.90997402e-02, 1.64049128e-02, 1.71918438e-02, 2.18204172e-02, 6.37273684e-02, -3.29250823e-03, -4.98043455e-02, -1.40451101e-01, -1.80122642e-02, 4.06017421e-02, 6.23755334e-02, 8.89441988e-02, 2.83597139e-02, -2.97716049e-03, -4.65948763e-02, 2.27337654e-02, -9.94350672e-02, -1.78192234e-01, 5.91906095e-02, 3.59999471e-02, -1.69525628e-02, -1.66512209e-02, -3.82389736e-03, -7.20959575e-02, 3.22806470e-02, 4.44450537e-03, -3.23058673e-02, 1.25253669e-02, 1.30899792e-02, 8.93481185e-02, 2.86762463e-03, 1.10700314e-02, -1.22144878e-01, -5.71595941e-03, 3.66282658e-04, 1.56995261e-02, -1.22357373e-02, 1.59408582e-02, 4.40903116e-02, -3.60543985e-03, -2.48614426e-02, -1.68921206e-01, -5.58646691e-01, 6.58017319e-02, 1.76604564e-02, -1.15903113e-01, -3.36370801e-02, -1.85469503e-02, -3.75705567e-02, 2.83317276e-02, -3.10387422e-02, 3.08129710e-02, 2.18607458e-02, -3.19521642e-01, -2.52779233e-01, -4.82589013e-02, -6.03900877e-02, 1.53042621e-02, -2.47277980e-02, -6.22708433e-03, 2.12911274e-02, 9.59108531e-03, -3.86259146e-02, 1.50187482e-03, -7.31890680e-03, -1.45703128e-01, 4.12161065e-02, 1.03176959e-01, -3.42082789e-02, -5.67937582e-02, -2.61103605e-02, 3.19469475e-02, -3.38260514e-02, -6.53638132e-02, 1.87109182e-02, 9.84160694e-02, -1.91896698e-02, 3.21464980e-02, 3.30259652e-03, 8.73666164e-02, 2.98608536e-02, 7.91984593e-01, 3.94995315e-01, 6.15712251e-02, 1.61218589e-01, 4.90423344e-03, 7.93206107e-03, -3.57386815e-02, 3.57547248e-02, -5.83095773e-02, -5.52726203e-03, 1.51861306e-02, -8.06160322e-03, 4.00984908e-02, -1.28021078e-02, -2.76722739e-03, -3.75358054e-03, -2.84465596e-02, -2.77620613e-02, 4.26837901e-01, 6.26844814e-02, 2.54898044e-02, 3.78441271e-02, 2.60777435e-02, -1.54149552e-02, 3.12747119e-04, -1.56816249e-02, 1.02199763e-01, 2.99451250e-02, -7.76510394e-03, -2.41259588e-02, -8.79856454e-02, -3.08557666e-02, 5.57078216e-02, 4.09424937e-03, 3.77930060e-02, -4.16172413e-02, 3.06651098e-02, -1.81801698e-01, 3.10753576e-03, -2.90793185e-02, 3.00639765e-02, -3.39551916e-02, -6.97539640e-02, -5.12769431e-02, 9.84049356e-03, 1.48562859e-02, 5.42737258e-02, 2.64912843e-02, -1.56071181e-02, -3.85628846e-02, 1.09073354e-01, 1.77568748e-02, -1.51665501e-02, -6.59558165e-02, 3.71680149e-02, -2.71636271e-02, -3.57579030e-02, -2.71691545e-02, -1.11886925e-01, -1.06504161e-01, 5.44206902e-02, -5.29888653e-03, 4.71328512e-02, 2.83829612e-03, 3.98387522e-02, -5.23891952e-02, -4.85792532e-02, -3.34252642e-02, -5.25727541e-02, -4.93095858e-02, -8.54501962e-02, -3.05661070e-02, 2.71456794e-02, 2.26488770e-02, 7.12734254e-02, 2.35900281e-03, -4.08279719e-02, -9.81265901e-02, 5.32019644e-02, -2.97856635e-02, -4.43270426e-02, -4.59389642e-03, 9.05909366e-02, 3.05820727e-02, 1.12216926e-02, 1.27520537e-02, -4.78654036e-02, 5.23364016e-02, -8.34303024e-02, 1.11162466e-02, 1.36551952e-02, -7.87588691e-02, -1.22695072e-01, -6.34126309e-02, -6.38278373e-02, -6.58404707e-02, -5.73449261e-02, -4.18624670e-02, -3.97320758e-02, 2.06972600e-02, 2.59053129e-03, -7.79654819e-02, -1.04203264e-01, -5.75346073e-02, 4.73933307e-02, -1.49645760e-02, 1.87818478e-02, -1.08451366e-02, -3.57126500e-02, -9.45262781e-02, -9.23691544e-02, -4.04544314e-02, -4.33165829e-02, 2.19591668e-02, 3.74459993e-02, 3.15834153e-02, -5.84065401e-02, -1.24504978e-02, -2.46501146e-02, -3.27867421e-02, -5.31501466e-02, -4.34086567e-02, 1.10753409e-01, 5.37251645e-02, 2.15755205e-02, -7.03657040e-04, 4.09387451e-02, 5.05764103e-02, 3.45464094e-02, 2.59528062e-03, -3.75720431e-02, -1.78566223e-03, -3.18314851e-02, -2.87080117e-02, -2.10695033e-02, -9.10627803e-03, -1.18166049e-02, 7.19441808e-02, -1.58139554e-01, -1.00094007e-02, 5.58276968e-03, -5.77554722e-02, 1.84361132e-02, -3.97105393e-03, -4.68604709e-02, 1.89925638e-01, 2.15231726e-01, -4.48387804e-02, -1.70773150e-02, -2.99474745e-02, 1.74642764e-03, 2.04813621e-02, 2.83048136e-02, 2.82142019e-02, 3.32761831e-02, -3.87784281e-02, 1.86949258e-02, -1.86039715e-01, -1.00603067e-02, -2.60913280e-02, -4.28919862e-02, 2.58133240e-02, -1.55749580e-02, -1.13553614e-01, 2.21047330e-02, 1.54242557e-02, 6.32439049e-02, 1.06379931e-01, 4.42994532e-02, -3.79505950e-02, 2.87563793e-02, 9.40401445e-02, 4.76861860e-02, 4.94747110e-02, -1.18643700e-02, -3.67323666e-02, 6.36656901e-03, -1.33198437e-03, 3.48374004e-02, -6.90567593e-02, -2.92914612e-02, 5.23554212e-02, 6.04427730e-02, -8.46483873e-03, -1.06184358e-02, -2.22457916e-02, -5.68078997e-02, 2.00774278e-01, 5.00052980e-02, 3.03185686e-02, 1.21883404e-02, 4.46500860e-02, 8.37086397e-03, -2.57761879e-03, -1.84247546e-02, -1.13954224e-02, -9.69173297e-02, -4.46002544e-02, -3.99730927e-02, -4.00321266e-02, 2.13972634e-02, -6.36083713e-03, -8.48165004e-02, -2.32184181e-02, 3.97805364e-03, -3.07890527e-02, 3.27493711e-02, 6.71551214e-03, -6.75034250e-02, 1.23895303e-01, 5.50681908e-02, 8.34323167e-03, 2.44280278e-02, 4.35127316e-02, 1.81199063e-02, 1.78055454e-01, 5.33632097e-02, 8.87309847e-02, -2.59350476e-02, -6.42714621e-02, -3.33936119e-02, 3.31167971e-02, 7.01395465e-02, -1.92706510e-03, 5.82422401e-02, 4.12351371e-02, 4.91822319e-02, -5.70302515e-02, -7.10676886e-02, -1.37647156e-01, -2.23362700e-01, -9.90566120e-02, 5.68773934e-02, -4.71164130e-02, -1.28439976e-03, 3.34128247e-02, 9.27288256e-03, -1.47646693e-04, 1.11975981e-02, 4.91782172e-02, 9.06417190e-03, -4.73515577e-05, 7.86769100e-02, -2.01075330e-02, -1.60249165e-01, -2.53632352e-01, 2.26769071e-02, -2.49450824e-02, 2.97273785e-02, -9.72528418e-02, 2.03690113e-02, 1.76293498e-02, -2.54357268e-02, -2.44910568e-02, -1.28244578e-03, 7.95478306e-02, 4.25161743e-02, 1.93800964e-02, -4.79279532e-02, 4.66500242e-02, -3.49915055e-02, -5.46582723e-02, 2.18499116e-01, 4.08249344e-02, 2.82663320e-01, 5.51587156e-02, 6.85553431e-02, 1.05417769e-01, -5.08228345e-03, -2.18968230e-03, -2.54587163e-02, 1.95061305e-02, -8.02862593e-02, -5.69804609e-02, 4.48280896e-02, -8.97242271e-03, 1.08022182e-02, -4.15291461e-02, 3.80409299e-02, -1.91058669e-02, -2.96402289e-02, 2.87456188e-02, 5.22061978e-02, 3.88108315e-02, 3.64426654e-02, 4.12939317e-02, 8.69343551e-02, -5.34332605e-02, -1.34055463e-03, -1.01729363e-02, -4.64127603e-02, -3.86917666e-02, 7.94390207e-03, 4.59189993e-02, 2.07411648e-02, 3.18059989e-02, -6.13984090e-02, -1.09557077e-02, -1.63439934e-03, 1.05997832e-02, -2.04909815e-02, -3.24829837e-02, 9.10222504e-03, -1.06302195e-01, -5.30796802e-02, -4.03394345e-02, -2.37499390e-02, 1.04739875e-03, -2.02637502e-02, 3.53381352e-01, -4.51995034e-02, -6.41102389e-02, -1.61092804e-02, 7.13384960e-02, -4.25974373e-02, 2.77472433e-02, -1.65934765e-02, -3.36618265e-02, 1.00560733e-01, 5.00861113e-04, -6.40364439e-03, -1.87719288e-02, -3.84959445e-02, -3.28433489e-02, 7.95073944e-02, 7.76962470e-02, 2.26594807e-02, -7.79505362e-03, 7.03243542e-03, 6.84015626e-03, -3.43551383e-02, 9.50835460e-03, 1.46782491e-02, -2.89826120e-02, -8.33580119e-03, -4.23878028e-02, -2.36776010e-02, -1.79628983e-02, 1.39185176e-01, 4.43681590e-02, 1.21827031e-02, 2.75494084e-02, 9.69089624e-03, -3.57070026e-01, -3.24759247e-02, 1.94945514e-02, 2.68389805e-02, 1.43115272e-01, 6.34077589e-02, 5.10130365e-02, 2.36331564e-02, 3.03747553e-02, 6.16634909e-02, 4.72311115e-02, -8.81636056e-02, 7.42585514e-02, 3.96574476e-02, -2.49728646e-02, -7.19698783e-02, 2.03987604e-02, -3.02826391e-02, -1.13429332e-01, 1.78812191e-02, -1.49933475e-02, -2.88767784e-02, -6.18932609e-02, -2.05672188e-02, 9.72277593e-02, -8.53036711e-02, -7.90872886e-02, -4.03117312e-03, 5.71215798e-02, -1.85923306e-01, -1.87668079e-03, 2.94923185e-02, 1.59184494e-02, 7.29433702e-02, 6.83186937e-02, -1.05632830e-02, -1.29535728e-01, -8.76563503e-02, -7.13052287e-02, -3.22303271e-02, 6.11832439e-03, 3.68352574e-04, 1.08667635e-01, 1.39365914e-02, 3.20827033e-02, -6.13275281e-03, -1.31942014e-02, 3.39933524e-03, -3.10787086e-02, 5.57963080e-02, 2.96821697e-02, -7.43772399e-02, -6.17040949e-03, 6.93893648e-02, 2.53435672e-02, 6.72528994e-02, 1.00931530e-02, 1.76761255e-01, 6.25978900e-02, 1.64706823e-01, 8.70921953e-02, -2.52153075e-02, -3.04149830e-02, -3.80093744e-02, 2.75739016e-02, -9.32310520e-03, 4.19253050e-02, -2.68932525e-02, 1.03856566e-02, 1.11958929e-02, 2.96166671e-02, -3.03517309e-02]])
After a model has been trained on a dataset, we can use the predict
method to classify new data:
my_text = "In this essay, I want to prove that computer fungi are chip animals!"
feats = (cv.transform([my_text]))
print(feats.shape)
print(classifier.predict(feats))
print(classifier.predict_proba(feats))
(1, 500) ['cs'] [[0.36993966 0.63006034]]
Using a coherent API that all models follows allows to quickly exchange parts without having to rewrite a lot of code!
If want to use another classifer, the same logic applies:
from sklearn.svm import LinearSVC
classifier = LinearSVC()
classifier.fit(np.asarray(freqs), articles_dataset["topic"])
my_text = "In this essay, I show that P equals NP if the algorithm is written in TurboPascal."
feats = cv.transform([my_text])
classifier.predict(feats)
array(['cs'], dtype=object)
Scikit-Learn
offers a variety of routines to convert your data into features.
We already used the CountVectorizer
which takes a collection of texts and builds a document term matrix.
Like the actual models, these preprocessing steps are implemented as classes and also follow a coherent API (which is even similar to the models API):
from sklearn.feature_extraction.text import CountVectorizer
texts = ["Alpha sagt Beta", "Beta sagt Alpha"]
count_vec = CountVectorizer()
count_vec.fit(texts)
count_vec.vocabulary_
{'alpha': 0, 'sagt': 2, 'beta': 1}
Similar to the models, preprocessing routines also often have to be fitted to your data.
For example, during fitting, the CountVectorizer
builds the vocabulary which is later used to fill the document-term matrix.
After CountVectorizer
is fitted, we obviously can't use it to predict new data, instead, we use it to transform the texts into a document-term matrix:
X = count_vec.transform(texts)
X.todense()
matrix([[1, 1, 1], [1, 1, 1]], dtype=int64)
Because the fitting and transformation of ("training") data is often used subsequently, there is also a shortcut that does both steps with just one call:
X = count_vec.fit_transform(texts)
count_vec.vocabulary_, X.shape, X.todense()
({'alpha': 0, 'sagt': 2, 'beta': 1}, (2, 3), matrix([[1, 1, 1], [1, 1, 1]], dtype=int64))
Note: If you want to encode more texts into a document-term matrix of the same format (i.e., same vocabulary), you can't use fit_transform
because it'll overwrite the old vocabulary: in that case, just use transform
!
Due to its conceptual and technical simplicity, KMeans is the most widely applied clustering algorithm.
Originally developed independently by two groups, it's also known under the name of "clustering via variance reduction".
Its outline is simple:
The goal of KMeans is to find an optimal set of $K$ centroids given a dataset.
The general algorithm for that is simple, and can neatly be expressed in pseudocode:
1. Initialize K centroids randomly
2. Compute initial clustering => Assignment of each point to its nearest centroid
3. Repeat until point-to-cluster assignment does not change anymore:
4. Compute new set of centroids by taking the average of all points of a cluster
5. Recompute the assignment by assigning each point to to its nearest updated centroid
6. Return final point-to-cluster assignment
While the algorithm itself is easy to express and lightweight to compute it has some constraints, you should know about before applying KMeans:
To overcome the problem of only finding a locally optimal solution, we can simply compute the algorithm multiple times with different initializations, compare the resulting clusterings, and choose the best one!
Question: How to determine the internal quality of KMeans-clustering (without any external supervision)?
Answer: Determine how close the points within a cluster lie together!
$$ \textrm{Cost for a single cluster: }\mathit{TD}(c) = \sum_{p \in C} d_{E}(p, centroid_c)^2 $$$$ \textrm{Cost for a complete clustering: }\mathit{TD} = \sum_{i = 1}^{K} TD(c_i) $$But enough theory, let's apply KMeans to synthetic data to see how it works:
from sklearn.cluster import KMeans
from sklearn.datasets import make_blobs
# generating synthetic data from isotropic Gaussian distributions
X, y = make_blobs(n_features=2, n_samples=500, centers = 5)
kmeans = KMeans(n_clusters=5)
cluster_pred = kmeans.fit_predict(X)
cluster_df = pd.DataFrame({
"x0": X[:, 0],
"x1": X[:, 1],
"y": y,
"cluster_pred": cluster_pred
})
sns.scatterplot(
x="x0",
y="x1",
style="y",
hue="cluster_pred",
data=cluster_df
)
# Draw centroids
for centroid in kmeans.cluster_centers_:
plt.scatter(*centroid)
C:\Goran\System\miniconda3\lib\site-packages\sklearn\cluster\_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning warnings.warn(
We can also apply KMeans to our dataset to check if we can find our topics without external labels.
Again we'll use the word frequencies as features.
from sklearn.cluster import KMeans
kmeans = KMeans(
n_clusters=3, # Fortunately, we know how many groups there are
n_init=50, # We want to compute 20 different clustering and choose the best one
)
kmeans.fit(np.asarray(freqs))
clusters_prediction = kmeans.predict(np.asarray(freqs))
articles_dataset["cluster_prediction"] = clusters_prediction
articles_dataset["cluster_prediction"].value_counts()
Evaluating clustering can be tricky because, in settings where we apply clustering, we usually have limited to no external data to see if our clusters correspond to meaningful and coherent groups.
Often, you'll need to check them manually and incorporate any useful metadata you have to check if the clustering can be useful.
Depending on the type of clustering algorithm there might be some metrics to internally validate the different clustering of the same data.
For example, to find a suitable $K$ for KMeans clustering, you can apply the Silhoutte Score to estimate a good value $K$.
But all those measures are only limited to checking certain general and broad assumptions on how a good clustering would look like without being able to account for your specific data.
If you happen to have (at least a small amount of labels), you can employ external validation metrics to check the quality of your clustering.
These measures quantify the overlap between the ground-truth clustering (=> labels) and the inferred clustering.
Clustering is about finding interesting groups within your data, so even if you happen to have ground truth data, a clustering that does not follow those inductive categories might not be useless or wrong.
Your clustering might reveal different groups or categories that are simply not reflected by your labels.
For example, if we used different features on our Wikipedia dataset we might have found clusters that capture a notion of theoretical vs. empirical subfields, dividing the articles on both biology and computer science based on whether they are concerned with theoretical or empirical questions.
So if you explicitly use clustering for exploring your data, always check your clusters manually and use a battery of different algorithms, hyperparameters and features.
If you see stable patterns emerge throughout a large number of different clustering, you can be fairly sure you found something worthy of further investigation.
Leaving any of those considerations aside, and just to check if our methodology works, we use the Adjusted Rand Score to externally evaluate our clustering:
pd.set_option('display.max_rows', None)
articles_dataset[["title", "text", "topic", "cluster_prediction"]]
from sklearn.metrics import adjusted_rand_score
ari = adjusted_rand_score(articles_dataset["topic"], articles_dataset["cluster_prediction"])
ari
articles_dataset.query("(topic == 'cs' and cluster_prediction != 1 and cluster_prediction != 2) or (topic == 'biology' and cluster_prediction != 0)")
print(articles_dataset.query("title == 'Competition'").iloc[0].text[:2000])
We can see that the inferred clustering matches the topics reasonably well, and judging by the titles of the "miss-clustered" article, the errors seem understandable since it is about a topic that draws inspiration from biology.
Instead of trying to find a fixed set of synthetic centroids representing a cluster, DBScan takes a completely different approach to identify clusters:
It searches the vector space for regions of high density (i.e., regions where many data points lie close together) and defines those regions as clusters.
Compared to KMeans it has several advantages that make it attractive:
Let's compare DBScan and KMeans on synthetic data:
from sklearn.cluster import KMeans, DBSCAN
from sklearn.datasets import make_moons
X, y = make_moons(n_samples=100)
kmeans = KMeans(n_clusters=2)
cluster_pred_kmeans = kmeans.fit_predict(X)
cluster_df_kmeans = pd.DataFrame({
"x0": X[:, 0],
"x1": X[:, 1],
"y": y,
"cluster_pred": cluster_pred_kmeans
})
dbscan = DBSCAN()
cluster_df_dbscan = dbscan.fit_predict(X)
cluster_df_dbscan = pd.DataFrame({
"x0": X[:, 0],
"x1": X[:, 1],
"y": y,
"cluster_pred": cluster_df_dbscan
})
fig, axs = plt.subplots(2, 1, figsize=(10, 10))
sns.scatterplot(
x="x0",
y="x1",
style="y",
hue="cluster_pred",
data=cluster_df_kmeans,
ax=axs[0]
)
axs[0].set_title("KMeans-Clustering")
sns.scatterplot(
x="x0",
y="x1",
style="y",
hue="cluster_pred",
data=cluster_df_dbscan,
ax=axs[1]
)
axs[1].set_title("DBSCAN-Clustering")
plt.tight_layout()
While KMeans was not able to retrieve the original groups, DBScan achieved that even without stating any hyperparameter manually.
But how does DBScan work?
While the exact algorithm of DBScan isn't hard to understand it's relatively tedious to read and execute so we skip this part and only care about the details relevant to choose suitable hyperparameters.
https://www.researchgate.net/publication/335485895/figure/fig2/AS:797412515909651@1567129367940/A-single-DBSCAN-cluster-with-Core-Border-and-Noise-Points.ppm
DBscan assigns each point in the dataset to one of three classes:
To create a cluster DBScan randomly picks out a point of the dataset and starts if it's a core point it starts to expand the cluster until it has found all core or border points that are reachable from a chain of core and border points starting from the picked point.
In essence, you need to define the parameters $eps$, $minPts$ and optionally a distance function to control how DBScan finds clusters!
Again, let's find out what DBScan finds in our cluster
dbscan = DBSCAN(
eps=0.55,
min_samples=4,
metric="cosine"
)
cluster_prediction_dbscan = dbscan.fit_predict(np.asarray(freqs))
articles_dataset["cluster_prediction_dbscan"] = cluster_prediction_dbscan
articles_dataset["cluster_prediction_dbscan"].value_counts()
for cluster_idx in articles_dataset["cluster_prediction_dbscan"].unique():
print(cluster_idx)
print(*articles_dataset.query("cluster_prediction_dbscan == @cluster_idx").title, sep=" | ", end="\n\n")
The results shed a light on the drawbacks of DBScan: Even though we do not need to specify the number of clusters explicitly, we have to tune the other parameters carefully to find clusters, and even if we find some clusters chances are high that a lot of points get labeled as noise.
To overcome these stepping stones we - broadly spoken - have two options:
Although partially deeply connected to distance-based clustering methods like KMeans, hierarchical clustering does not aim at returning a fixed clustering rather it sorts all data point to a similarity-based hierarchy from which many clusters can be derived.
This is particularly useful when working on smaller datasets because it enables you to quickly get a global overview of your data.
Its algorithm is straightforward:
1. Initially each data point is a cluster in its own.
2. While not all data points belong to the same cluster, repeat:
3. Find the two clusters with minimal distance two each other, and merge them to one cluster.
As you can guess from the algorithm, this method also does not have many hyperparameters, which makes it quite robust.
The two decisions you have to make are:
There are three strategies to measure the distance between clusters:
Instead of returning a single clustering, hierarchical clustering returns a hierarchy of clusters.
Often this hierarchy is visualized as dendrogram.
Let's compute the dendrogram for our wiki dataset:
from sklearn.cluster import AgglomerativeClustering
agglom = AgglomerativeClustering(affinity="cosine", linkage="complete", n_clusters=1)
agglom.fit(np.asarray(freqs))
from scipy.cluster.hierarchy import dendrogram
def plot_dendrogram(model, **kwargs):
# Create linkage matrix and then plot the dendrogram
# create the counts of samples under each node
counts = np.zeros(model.children_.shape[0])
n_samples = len(model.labels_)
for i, merge in enumerate(model.children_):
current_count = 0
for child_idx in merge:
if child_idx < n_samples:
current_count += 1 # leaf node
else:
current_count += counts[child_idx - n_samples]
counts[i] = current_count
distance = np.arange(model.children_.shape[0])
linkage_matrix = np.column_stack(
[model.children_, distance, counts]
).astype(float)
# Plot the corresponding dendrogram
dendrogram(linkage_matrix, **kwargs)
return linkage_matrix
plt.subplots(figsize=(10, 5), dpi=500)
linkage_matrix = plot_dendrogram(agglom, truncate_mode="level", p=94)
plt.show()
The dendrogram visualizes the hierarchy of clusters as a tree, each leaf (i.e., an endpoint on the x-axis) represents a data point, and each vertical line represents a merge of two clusters.
Additionally, the height of the merges (where they are positioned on the y-axis) shows the distance between the clusters that were merged.
These properties make dendrogram a really powerful tool for exploring your data globally.
Looking at your data that way often reveals interesting findings.
For example in the dendrogram for our dataset, we see that points 27 and 41 are merged relatively early but then get merged into another cluster pretty late.
Let's look at them in more detail:
articles_dataset.iloc[[27, 41]]
Both are about scientific journals in biology.
There are two simple ways to extract a single fixed clustering from an agglomerative clustering:
Choosing one of these approaches depends on your specific use case.