Data Science for Humanities 1¶

Session: Explorative Analysis¶

Part 1: Descriptive Analysis and visualization¶

Winter term 22/23¶

Prof. Goran Glavaš, Lennart Keller¶

Goals¶

After this session you'll know about:

  • Basic terms of descriptive statistics
  • Measures of central tendency
  • Kurtosis and Skewness
  • Basics of data visualization with seaborn
  • Bivariate and multivariate data analysis
    • Covariance
    • Correlation

Basic terms¶

Descriptive statistics¶

A collection of statistical tools to summarize the quantitative characteristics of your data.

Inductive statistic/ Stastical inference¶

Use your data to draw conclusions, making forecasts or predict unseen data points.

Population¶

The term population refers to the full (and complete) set of members of the things you want to measure/ collect data on.

For example, if we want to analyze the amount of Eroticism within german dime novels, the population would be all german dime novels, that were ever published.

Sample¶

Since it is seldom possible to collect all members of a population, we have to select some members of the population.

This selection is called a sample.

Representativeness of samples¶

To able able to generalize from our sample to the population, we want our sample to be representative of the population, so we have to draw the single samples in a manner that ensures this.

Often, you assume that by randomly sampling from the population, you create a representative sample.

But be aware, that random sampling is not always the best or feasible way to create representative samples.

For example, suppose the dime novel example from above:

In Germany, dime novels were published for roughly 150 years.

But only contemporary novels are digitalized, so if you would just draw random samples, from the pool of digitalized novels, you would skew your sample towards modern dime novels, and would not be able to generalize your findings to earlier works.

Unfortunately, in those cases it can get very tedious to assemble a representative dataset, because often it requires manual effort (e.g., digitizing earlier novels).

Parameter¶

Parameters are values that are generated by a population.

=> Note: The term parameter has multiple meanings in statistics and data science, so take this notion as very tailored to descriptive statistics.

Example:

The number mean number of terms that indicate physical affection (kiss, caress, ...) in all german dime novels.

Statistic¶

Statistics are values that can be inferred based on a sample.

Example:

The number mean number of terms that indicate physical affection (kiss, caress, ...) in a sample of german dime novels

Distribution¶

The quantity of specific values (or value ranges) within one one variable of your data.

Distributions can be divided in certain types (normal, exponential, ...)

The most common type of distribution is the gaussian or normal distribution.

Data often follows this distribution

Univariate data analysis¶

Statistical analysis of only one set of measurements

Example: How tall are people?

Bivariate or Multivariate data analysis¶

Joint statistical analysis of two or multiple measurements.

Example: How is there are relationship between peoples height and weight (and gender)?

Univariate data analysis: Measures of central tendency¶

Given a simple dataset consisting of a single set of measurements.

How would you start to get a glimpse of the data?

Example dataset: Age of deceased inhabitants in Accrington (Town in northwestern England) in 1830.

In [1]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

df = pd.read_csv("aod.csv")

df
Out[1]:
age_of_death
0 0
1 0
2 0
3 0
4 0
5 0
6 0
7 1
8 1
9 1
10 1
11 2
12 3
13 8
14 12
15 13
16 15
17 17
18 20
19 21
20 22
21 22
22 25
23 29
24 32
25 35
26 38
27 39
28 41
29 47
30 48
31 54
32 57
33 63
34 73
35 78
36 80
37 82

Arithmetic Mean¶

$$ \frac{1}{N}\sum_{i=1}^{N} x_i $$

Sum of all values divided by the total number of values.

In [2]:
df["age_of_death"].mean()
Out[2]:
25.789473684210527

Median¶

The "middle value" of a dataset.

It separates the greater and lesser halves of the dataset.

In [3]:
df["age_of_death"].median()
Out[3]:
20.5

Question: The dataset measures birthdays (i.e., no fraction of years), so why is the median $20.5$?

In [4]:
df.shape
Out[4]:
(38, 1)

If a dataset contains an even number of values, there is no middle value. To compute a valid median the average of the two "middle values" is taken

In [5]:
df.iloc[len(df)//2-1:len(df)//2+1]
Out[5]:
age_of_death
18 20
19 21
In [6]:
df.iloc[len(df)//2-1:len(df)//2+1]["age_of_death"].mean()
Out[6]:
20.5

Mode¶

The most common value in the dataset.

In [7]:
from collections import Counter

Counter(df["age_of_death"]).most_common(1)
Out[7]:
[(0, 7)]
In [8]:
df["age_of_death"].mode()
Out[8]:
0    0
Name: age_of_death, dtype: int64

Strength and weaknesses measures of central tendency¶

Mean¶

Takes all data points into account.

Can produce potentially misleading results, because it is affected by outliers.

Median¶

The median is not affected by outliers and should be favored when the data is widespread or contains outliers.

Mode¶

The mode is the - by far - least often used measure of central tendency.

It doesn't take into account multiple values, so in general, it has to be accompanied by other measures to paint the whole picture.

Visualization of measures of central tendency¶

Since in most cases, none of the above values will suffice to solely describe a dataset, you'll should jointly interpret them.

One easy way to do this (non-visually) in Pandas is by calling the .describe-method.

In [9]:
df["age_of_death"].describe()
Out[9]:
count    38.000000
mean     25.789474
std      25.987687
min       0.000000
25%       1.000000
50%      20.500000
75%      40.500000
max      82.000000
Name: age_of_death, dtype: float64

It gives you a valid overview of the central tendency of your data, albeit leaving out the mode.

Another way to describe the central tendency of your data visually is leveraging box-plots.

They summarize key measures into one standardized plot type.

To create box-plots, we'll use seaborn a plotting library specifically tailored to do statistical data visualization with Python and pandas.

Very brief introduction into seaborn¶

In essence, seaborn is a convenience wrapper around matplotlib to create plots of data gathered in pandas.dataframes.

Installation¶
pip|conda install seaborn
Import¶
import seaborn as sns
# It is also helpful to keep matplotlib around when working with seaborn
import matplotlib.pyplot as plt
Usage¶

Instead of offering tools to create plot yourself like matplotlib offers you a (wide) variety of prebuilt plot types out-of-the-box.

To get an overview of seaborn's capabilities look into their documentation.

In general any plot-type can be created by calling the respective function, and providing three types of arguments

sns.lineplot(x="x_column_name_in_dataframe", y="y_column_in_dataframe", data=df)

As we did with pandas we'll introduce seaborns features, along the way.

But, if we have a little bit of spare time, we'll also take a short peak behind the curtain of seaborn and look into some of its characteristics in detail :)

And now back to statistics: The box-plot

In [10]:
sns.boxplot(y="age_of_death", data=df)
sns.swarmplot(y="age_of_death", data=df)
Out[10]:
<AxesSubplot:ylabel='age_of_death'>

Interpreting a boxplot.¶

The boxplot summarizes several views on our data at once.

As you can see the horizontal lines indicate certain values within your dataset.

The 25%-Quartile is that value within your data that is greater than 25% of all your data points, (just like the median or 75%).

The box covers 50% of all your data points, its value range (the values where the box "starts" and "ends") shows the interquartile range (IQR).

The interquartile range is another useful measure to quantify the spread of data. It is defined by the difference of the 25%- and 75%-quartile and - in layman terms - shows the range in which 50% of your data lies.

If your data is very skewed and contains large outliers, the antennas would reach too far to indicate the real minimum- and maximum-values. In those cases dots are used to show the extreme outliers.

There are numerous methods, to classify which values are "extreme outliers".

Most commonly values that lie 1.5x IQR above or below the first or third quartile are defined as outliers.

In [11]:
tips = sns.load_dataset("tips")
tips
sns.boxplot(y="tip", data=tips)
Out[11]:
<AxesSubplot:ylabel='tip'>
In [12]:
fig, axs = plt.subplots(3, 1)
sns.boxplot(x="tip", ax=axs[0], data=tips)
sns.histplot(x="tip", ax=axs[1], data=tips)
sns.kdeplot(x="tip", ax=axs[2], data=tips)
plt.tight_layout()

Variability¶

Measures of central tendencies show us where the majority of our data points are located.

Measure that account for variability tell us how far they are apart.

You can already get a glimpse of variability by looking at a boxplot!

But there are also standardized ways of measuring it numerically.

In general, variability tells us how much our measurements fluctuate and it might help us to understand how reliable our dataset can be used to make estimates about the population.

Measures of variability¶

By introducing box-plots we already learned about a measure of variability

Interquartile range¶

The interquartile range measures how 50% of the data is spread.

So while not painting the full picture it gives you a rough estimate of how your data is spread.

Standard deviation¶

The standard deviation measures the average amount of variability in your data. (Thereby taking into account all data points)

It simply measures the normalized sum of squared differences between each data point and the data's mean.

$$ s = \sqrt{\frac{1}{n-1}\sum_{i=1}^{N} (x_i - \mu)^2} $$

$\mu :=$ Arithmetic average of the data

$N :=$ Number data points

Note:

While this formula bears some resemblance to the arithmetic average, there is one conceptual deviation.

Instead of normalizing the sum of deviations from the mean, by the number of samples, we normalize by $N-1$.

The reason for that is that we typically estimate the standard deviation based on a sample.

Empirically, those estimates tend to be too low, so by decreasing the denominator, we "de-bias" or by enlarging it artificially.

If you compute the standard deviation on a population, you can drop this correction.

In [13]:
df["age_of_death"].std(), df["age_of_death"].std(ddof=0) # ddf=0 computes the population std
Out[13]:
(25.98768720378925, 25.64346422809774)

Variance¶

The variance is conceptually strongly related to the standard deviation:

$$ \sigma^2 = \frac{1}{n-1}\sum_{i=1}^{N} (x_i - \mu)^2 $$

or even:

$$ o^2 = s^2 $$

The only difference is that the square root is just dropped thereby making it harder to interpret since there is no clear correspondence to the actual values anymore.

But it is used in some statistical procedures and tests.

In [14]:
df["age_of_death"].var(), df["age_of_death"].var(ddof=0), 
Out[14]:
(675.3598862019915, 657.5872576177285)
In [15]:
# By taking the square root of the variance, we obtain the standard deviation
np.sqrt(df["age_of_death"].var())
Out[15]:
25.98768720378925

Univariate data analysis: Skewness and Kurtosis¶

But first: Let's look at distributions¶

Before we look into skewness and kurtosis, we have to broaden our scope and look at distributions.

We already learned a way of displaying (certain stats) of a distribution graphically: The box-plot

But there are additional ways to display a distribution of values:

In [16]:
sns.histplot(x="tip", data=tips)
Out[16]:
<AxesSubplot:xlabel='tip', ylabel='Count'>

Histograms sort the single values of the dataset into bins that represent certain value ranges, and then count the number of measurements in each bin, and plot those counts as a barplot.

They are great to visualize distributions, but their quality highly depends, on the number of bins, which is a hyperparameter.

Depending on the number of bins, the result might produce missleading results:

In [17]:
sns.histplot(x="tip", data=tips, bins=3)
Out[17]:
<AxesSubplot:xlabel='tip', ylabel='Count'>

In practice, though, you shouldn't worry about that too much, since most plotting libraries have robust routines to infer a suitable number of bins.

Nonetheless, histograms enable you to create a view of your data that might reveal interesting insights that boxplots can't deliver.

To highlight this let's create some data artificially using a Gaussian (or normal distribution).

The Gaussian distribution is the most widely used type of distribution since many phenomena are normally distributed.

For example, the height of humans follows a normal distribution.

In general, the normal distribution can be used to describe phenomena where most of the population is close to the mean, and the probability of encountering samples that deviate from the mean decreases exponentially with the amount of deviation.

In [18]:
from scipy.stats import norm

# Create a normal distribution
dist = norm(scale=1.0, loc=0.0)

# Randomly sample values from the distribution
random_values = dist.rvs(size=10_000)

# Create data to visualize the probability density function (pdf)
idc = np.linspace(random_values.min(), random_values.max(), 10_000)
pdf = norm.pdf(idc)
generated_dataset = pd.DataFrame({"rand": random_values, "idc": idc, "pdf": pdf})

# Plot random values and the pdf
fig, axs = plt.subplots(2, 1)
sns.histplot(x="rand", ax=axs[0], data=generated_dataset)
sns.lineplot(x="idc", y="pdf", color="red", data=generated_dataset)
Out[18]:
<AxesSubplot:xlabel='idc', ylabel='pdf'>

Here we sampled some random values from an artificial Gaussian distribution, and also computed the probability density function of the distribution.

This function gives us the probability of drawing a value if randomly sample from the distribution (=> the area under the curve adds up to 1).

If we compare this curve with the histogram, we see that bins in regions of high probability density contain more values than those in lower-density regions.

So, histograms can reveal a discrete approximation of the shape of the distribution of the data (not only for Gaussian distributions!).

Knowing about the shape of the distribution is important to pick the right tools, so it is an essential step in exploratory data analysis.

seaborn even comes with a plotting function that tries to infer the continuous shape of the underlying distribution:

In [19]:
fig, axs = plt.subplots(3, 1, figsize=(10, 7.5))
titanic = sns.load_dataset("titanic")

fig.suptitle("Age of the passangers of the titanic")
sns.boxplot(x="age", ax=axs[0], data=titanic)
axs[0].set_title("Boxplot")
sns.histplot(x="age", ax=axs[1], data=titanic)
axs[1].set_title("Histogram")
sns.kdeplot(x="age", ax=axs[2], data=titanic)
axs[2].set_title("Density-estimation plot")
plt.tight_layout()

As you can see here, all these three plots offer a slightly different perspective on the data.

And, also we can see that the amount of tips in the tips dataset seems to follow a sort of Gaussian distribution, but deviates from the "ideal" form in some ways:

  • It is not symmetric, which makes sense since age can't take on negative values.
  • The width of the distribution seems to be very wide, indicating a large spread of values.

To quantify those properties, we can compute skewness and kurtosis.

Skewness¶

Skewness measures the asymmetry of a probability distribution.

Skewness can be divided into three types:

  • Positive => Tail is on the right
  • Negative => Tail is to the left
  • Zero => Distribution is symmetric => Tails at both sides are roughly equally long.

There is more than one way to compute skewness, normally the Fishers-Pearsons correlation coefficient is used:

$$ G_1 = \frac{\sqrt{N(N-1)}}{N-2} \frac{\frac{1}{N}\sum_{i=1}^{N}(X_i - \tilde{X})^3}{s^3} $$

$\tilde{X} :=$ mean of the dataset

$s^3 :=$ "standard deviation" to the power of $3$

Let's compute the skewness of the age of the titanic's passengers:

In [20]:
titanic["age"].skew()
Out[20]:
0.38910778230082704

The value is positive correctly indicating that the tail to the right is longer than on the other side.

Kurtosis¶

Kurtosis describes the "tailedness" or "peakedness" of a distribution.

A distribution with a low value of kurtosis looks more flattened, meaning that it accumulates a lot of its probability mass in the tails, and its peak is more rounded.

https://www.analyticsvidhya.com/blog/2021/05/shape-of-data-skewness-and-kurtosis/

The formula to compute kurtosis is conceptually strongly related to the formula for skewness:

$$ \frac{\sum_{i=1}^{N}(X_i - \tilde{X})^4}{s^4} $$

Again, let's compute the kurtosis for the age of the passengers:

In [21]:
titanic["age"].kurt()
Out[21]:
0.17827415364210353
In [39]:
x = np.linspace(-1, 1, 1000)
y2 = x**2
y3 = x**3
y4 = x**4

plot_df = pd.DataFrame({"x": x, "y2": y2, "y3": y3, "y4": y4})
sns.lineplot(x=x, y=y4)
Out[39]:
<AxesSubplot:>

Excess Kurtosis¶

In contrast, to skewness which is informative by itself, interpreting kurtosis can be more challenging.

To ground kurtosis values they can be compared against the standard normal distribution, as a point of reference.

To do this you simple subtract the actual kurtosis value by three:

$$ \textrm{Excess Kurtosis} = \textrm{Kurtosis} - 3 $$

If the Excess kurtosis is:

  • positive, then the distribution is more narrowly tailored than a normal distribution.
  • negative, then the distribution is flatter than a normal distribution.

Bi- and multivariate data analysis¶

While univariate measurements give you information about a single variable, more often you'll want to explore potential relationships between multiple variables.

This type of analysis is called bivariate analysis if you'll investigate two variables at once, or multivariate if you search for relationships within more than two variables.

If you follow the standard procedure of exploratory data analysis, then you would first analyze each variable on its own, and then proceed to investigate possible relationships between them.

Covariance¶

The most basic way to check if two variables are potentially related is to ask whether a value change of one variable, implies that the other variable changes too.

This relationship in change can come in two flavors: If one value increases the other one might either also increase, or decrease.

If one variable changes while the other one doesn't then there is possibly no relation at all.

Covariance is a simple way to measure those monotone relations:

$$ \textrm{Cov}(X, Y) = \sum_{i=1}^{N} \frac{(x_i - \tilde{X}) - (y_i - \tilde{Y})}{N} $$

$\tilde{X}, \tilde{Y} :=$ mean of $X$ and $Y$

To see covariance in action let's generate some artificial data:

In [44]:
x = np.linspace(0, 1, 500)
y = -x
w = (np.random.randn(500)*.15) + x
u = np.tanh(np.sin(x**20))
z = np.random.rand(500)
o = np.linspace(3, 4, 500)

cov_df = pd.DataFrame(dict(x=x, y=y, w=w,u=u, z=z, o=o))

Covariance can also, often be observed visually.

A great way to compare each pair variables in a dataset against each other is seaborn's pairplot:

In [45]:
sns.pairplot(data=cov_df)
Out[45]:
<seaborn.axisgrid.PairGrid at 0x16ae54ca0>

It's visible that there is mostly a clear relationship between those variables (except z).

Now let's compute the covariance matrix between all variables:

In [46]:
cov_mat = cov_df.cov()
sns.heatmap(data=cov_mat, annot=True)
Out[46]:
<AxesSubplot:>

By looking at the matrix we can see, that "our assumption" holds true for variables $x, w, y$ and $z$.

Interestingly, although there is a clear relationship between $u$ and $x, w, y$ the covariance is close to $0$ for those pairs.

This is because covariance is only suitable to measure linear relationships (e.g., for any amount that $x$ varies, $y$ varies for the same amount times a fixed factor)

Correlation¶

If you look at the covariance map from above, you can see that the amount of covariance between the features varies quite a lot.

Especially, if you look along the diagonal axis, describing the covariance of each feature with itself you see that while each pair has maximum covariance, the actual value is determined by the magnitude of the features, making comparison hard at times.

To alleviate this issue, you can use an alternative (some might say extended) measure to describe the relation between two variables: Correlation

There are many ways to measure correlation, the most commonly used measure is the Pearson Correlation Coefficient:

$$ Corr_{Pearson}(X, Y) = \frac{Cov(X, Y)}{s_X s_Y} $$

with

$s_X, s_Y :=$ standard deviation of $X$ and $Y$

Strictly speaking, the Pearson Correlation Coefficient is the standardized form the covariance, therefore inheriting most of the properties of the Covariance, like only being able to account for linear types of correlations.

Let's compute the correlation with Pandas:

In [47]:
corr_mat = cov_df.corr(method="pearson")
sns.heatmap(data=corr_mat, annot=True)
Out[47]:
<AxesSubplot:>

A first glance at the diagonal of the correlation matrix reveals, that the correlation is bound to a value of 1.

By looking a the correlation coefficients for $x$ and $y$, we also observe that the correlation values can that correlation can take on negative values if the linear relationship between two variables is negative.

Note:

As you can see, in the code above, we explicitly told pandas to compute the pearson correlation for a data. This is because the .corr-method suports also supports to compute two types of rank correlation coefficients (spearman and kendall).

These measures do not compare actual values changes within your data, but rather sort the data and then analyze how the ranks of each data point in the sorted list are correlated.

Nonetheless, by default, pandas computes the pearson correlation, so stating it isn't strictly necessary.

Correlation vs. causation¶

As we've all learned through popular media during the pandemic, just because two variables correlate it doesn't mean that you can safely assume there is a direct causal reason for that.

Strictly, speaking correlation is a necessary but not sufficient cause for causation!

Consider this example taken from Richard McElreath's book "Statistical Rethinking".

It shows the divorce rate in all US states and the number of waffles houses in those states:

In [26]:
waffles = pd.read_csv("WaffleDivorce.csv", sep=";")
waffles.head()
Out[26]:
Location Loc Population MedianAgeMarriage Marriage Marriage SE Divorce Divorce SE WaffleHouses South Slaves1860 Population1860 PropSlaves1860
0 Alabama AL 4.78 25.3 20.2 1.27 12.7 0.79 128 1 435080 964201 0.45
1 Alaska AK 0.71 25.2 26.0 2.93 12.5 2.05 0 0 0 0 0.00
2 Arizona AZ 6.33 25.8 20.3 0.98 10.8 0.74 18 0 0 0 0.00
3 Arkansas AR 2.92 24.3 26.4 1.70 13.5 1.22 41 1 111115 435450 0.26
4 California CA 37.25 26.8 19.1 0.39 8.0 0.24 0 0 0 379994 0.00
In [27]:
sns.regplot(x="WaffleHouses", y="Divorce", data=waffles)
Out[27]:
<AxesSubplot:xlabel='WaffleHouses', ylabel='Divorce'>
In [28]:
sns.heatmap(data=waffles[["Divorce", "WaffleHouses"]].corr(), annot=True)
Out[28]:
<AxesSubplot:>

While a Pearson correlation $0.25$ is no sign for a strong correlation, this value suggests a least a weak correlation.

But, obviously, there is no good direct explanation that can account for this weird relationship.

A more intriguing cause can be found, by looking at some the median age of marriage:

In [29]:
sns.regplot(x="MedianAgeMarriage", y="Divorce", data=waffles)
Out[29]:
<AxesSubplot:xlabel='MedianAgeMarriage', ylabel='Divorce'>
In [30]:
sns.heatmap(data=waffles[["MedianAgeMarriage", "Divorce"]].corr(), annot=True)
Out[30]:
<AxesSubplot:>

Not only do these values correlate much stronger, they also offer a better explanation for the divorce rate and can shed light on the weird waffle house riddle.

If we look at the states and their number median age of marriage:

In [49]:
waffles.sort_values(by="MedianAgeMarriage")[["Location", "MedianAgeMarriage"]].head(10)
Out[49]:
Location MedianAgeMarriage
12 Idaho 23.2
43 Utah 23.3
49 Wyoming 24.2
3 Arkansas 24.3
35 Oklahoma 24.4
17 Kentucky 24.8
47 West Virginia 25.0
16 Kansas 25.0
1 Alaska 25.2
42 Texas 25.2

We see that people in the more rural southern states tend to get married at young ages.

Without further data, it would be speculation what drives them to get married so early. (Possibly more conservative religious beliefs and values...)

But it's safe to assume that marriages of older people who might have spent some time in a relationship before are generally more stable.

Furthermore, to explain the correlation of divorce of WaffleHouse we have to know that the WaffleHouse brand originated in Georgia and is predominantly common in the south of the US.

So the correlation, between the divorce rate and the number of WaffleHouses per state is just a coincidence driven by external factors (that our data does not completely reflect).

So stay aware of spurious correlations and always be critical of your findings!