After this session you'll know about:
seaborn
A collection of statistical tools to summarize the quantitative characteristics of your data.
Use your data to draw conclusions, making forecasts or predict unseen data points.
The term population refers to the full (and complete) set of members of the things you want to measure/ collect data on.
For example, if we want to analyze the amount of Eroticism within german dime novels, the population would be all german dime novels, that were ever published.
Since it is seldom possible to collect all members of a population, we have to select some members of the population.
This selection is called a sample.
To able able to generalize from our sample to the population, we want our sample to be representative of the population, so we have to draw the single samples in a manner that ensures this.
Often, you assume that by randomly sampling from the population, you create a representative sample.
But be aware, that random sampling is not always the best or feasible way to create representative samples.
For example, suppose the dime novel example from above:
In Germany, dime novels were published for roughly 150 years.
But only contemporary novels are digitalized, so if you would just draw random samples, from the pool of digitalized novels, you would skew your sample towards modern dime novels, and would not be able to generalize your findings to earlier works.
Unfortunately, in those cases it can get very tedious to assemble a representative dataset, because often it requires manual effort (e.g., digitizing earlier novels).
Parameters are values that are generated by a population.
=> Note: The term parameter has multiple meanings in statistics and data science, so take this notion as very tailored to descriptive statistics.
Example:
The number mean number of terms that indicate physical affection (kiss, caress, ...) in all german dime novels.
Statistics are values that can be inferred based on a sample.
Example:
The number mean number of terms that indicate physical affection (kiss, caress, ...) in a sample of german dime novels
The quantity of specific values (or value ranges) within one one variable of your data.
Distributions can be divided in certain types (normal, exponential, ...)
The most common type of distribution is the gaussian or normal distribution.
Data often follows this distribution
Statistical analysis of only one set of measurements
Example: How tall are people?
Joint statistical analysis of two or multiple measurements.
Example: How is there are relationship between peoples height and weight (and gender)?
Given a simple dataset consisting of a single set of measurements.
How would you start to get a glimpse of the data?
Example dataset: Age of deceased inhabitants in Accrington (Town in northwestern England) in 1830.
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
df = pd.read_csv("aod.csv")
df
age_of_death | |
---|---|
0 | 0 |
1 | 0 |
2 | 0 |
3 | 0 |
4 | 0 |
5 | 0 |
6 | 0 |
7 | 1 |
8 | 1 |
9 | 1 |
10 | 1 |
11 | 2 |
12 | 3 |
13 | 8 |
14 | 12 |
15 | 13 |
16 | 15 |
17 | 17 |
18 | 20 |
19 | 21 |
20 | 22 |
21 | 22 |
22 | 25 |
23 | 29 |
24 | 32 |
25 | 35 |
26 | 38 |
27 | 39 |
28 | 41 |
29 | 47 |
30 | 48 |
31 | 54 |
32 | 57 |
33 | 63 |
34 | 73 |
35 | 78 |
36 | 80 |
37 | 82 |
Sum of all values divided by the total number of values.
df["age_of_death"].mean()
25.789473684210527
df["age_of_death"].median()
20.5
Question: The dataset measures birthdays (i.e., no fraction of years), so why is the median $20.5$?
df.shape
(38, 1)
If a dataset contains an even number of values, there is no middle value. To compute a valid median the average of the two "middle values" is taken
df.iloc[len(df)//2-1:len(df)//2+1]
age_of_death | |
---|---|
18 | 20 |
19 | 21 |
df.iloc[len(df)//2-1:len(df)//2+1]["age_of_death"].mean()
20.5
The most common value in the dataset.
from collections import Counter
Counter(df["age_of_death"]).most_common(1)
[(0, 7)]
df["age_of_death"].mode()
0 0 Name: age_of_death, dtype: int64
Takes all data points into account.
Can produce potentially misleading results, because it is affected by outliers.
The median is not affected by outliers and should be favored when the data is widespread or contains outliers.
The mode is the - by far - least often used measure of central tendency.
It doesn't take into account multiple values, so in general, it has to be accompanied by other measures to paint the whole picture.
Since in most cases, none of the above values will suffice to solely describe a dataset, you'll should jointly interpret them.
One easy way to do this (non-visually) in Pandas is by calling the .describe
-method.
df["age_of_death"].describe()
count 38.000000 mean 25.789474 std 25.987687 min 0.000000 25% 1.000000 50% 20.500000 75% 40.500000 max 82.000000 Name: age_of_death, dtype: float64
It gives you a valid overview of the central tendency of your data, albeit leaving out the mode.
Another way to describe the central tendency of your data visually is leveraging box-plots.
They summarize key measures into one standardized plot type.
To create box-plots, we'll use seaborn
a plotting library specifically tailored to do statistical data visualization with Python and pandas
.
seaborn
¶In essence, seaborn
is a convenience wrapper around matplotlib
to create plots of data gathered in pandas.dataframes
.
pip|conda install seaborn
import seaborn as sns
# It is also helpful to keep matplotlib around when working with seaborn
import matplotlib.pyplot as plt
Instead of offering tools to create plot yourself like matplotlib
offers you a (wide) variety of prebuilt plot types out-of-the-box.
To get an overview of seaborn's
capabilities look into their documentation.
In general any plot-type can be created by calling the respective function, and providing three types of arguments
sns.lineplot(x="x_column_name_in_dataframe", y="y_column_in_dataframe", data=df)
As we did with pandas
we'll introduce seaborns
features, along the way.
But, if we have a little bit of spare time, we'll also take a short peak behind the curtain of seaborn
and look into some of its characteristics in detail :)
And now back to statistics: The box-plot
sns.boxplot(y="age_of_death", data=df)
sns.swarmplot(y="age_of_death", data=df)
<AxesSubplot:ylabel='age_of_death'>
The boxplot summarizes several views on our data at once.
As you can see the horizontal lines indicate certain values within your dataset.
The 25%-Quartile is that value within your data that is greater than 25% of all your data points, (just like the median or 75%).
The box covers 50% of all your data points, its value range (the values where the box "starts" and "ends") shows the interquartile range (IQR).
The interquartile range is another useful measure to quantify the spread of data. It is defined by the difference of the 25%- and 75%-quartile and - in layman terms - shows the range in which 50% of your data lies.
If your data is very skewed and contains large outliers, the antennas would reach too far to indicate the real minimum- and maximum-values. In those cases dots are used to show the extreme outliers.
There are numerous methods, to classify which values are "extreme outliers".
Most commonly values that lie 1.5x IQR above or below the first or third quartile are defined as outliers.
tips = sns.load_dataset("tips")
tips
sns.boxplot(y="tip", data=tips)
<AxesSubplot:ylabel='tip'>
fig, axs = plt.subplots(3, 1)
sns.boxplot(x="tip", ax=axs[0], data=tips)
sns.histplot(x="tip", ax=axs[1], data=tips)
sns.kdeplot(x="tip", ax=axs[2], data=tips)
plt.tight_layout()
Measures of central tendencies show us where the majority of our data points are located.
Measure that account for variability tell us how far they are apart.
You can already get a glimpse of variability by looking at a boxplot!
But there are also standardized ways of measuring it numerically.
In general, variability tells us how much our measurements fluctuate and it might help us to understand how reliable our dataset can be used to make estimates about the population.
By introducing box-plots we already learned about a measure of variability
The interquartile range measures how 50% of the data is spread.
So while not painting the full picture it gives you a rough estimate of how your data is spread.
The standard deviation measures the average amount of variability in your data. (Thereby taking into account all data points)
It simply measures the normalized sum of squared differences between each data point and the data's mean.
$$ s = \sqrt{\frac{1}{n-1}\sum_{i=1}^{N} (x_i - \mu)^2} $$$\mu :=$ Arithmetic average of the data
$N :=$ Number data points
Note:
While this formula bears some resemblance to the arithmetic average, there is one conceptual deviation.
Instead of normalizing the sum of deviations from the mean, by the number of samples, we normalize by $N-1$.
The reason for that is that we typically estimate the standard deviation based on a sample.
Empirically, those estimates tend to be too low, so by decreasing the denominator, we "de-bias" or by enlarging it artificially.
If you compute the standard deviation on a population, you can drop this correction.
df["age_of_death"].std(), df["age_of_death"].std(ddof=0) # ddf=0 computes the population std
(25.98768720378925, 25.64346422809774)
The variance is conceptually strongly related to the standard deviation:
$$ \sigma^2 = \frac{1}{n-1}\sum_{i=1}^{N} (x_i - \mu)^2 $$or even:
$$ o^2 = s^2 $$The only difference is that the square root is just dropped thereby making it harder to interpret since there is no clear correspondence to the actual values anymore.
But it is used in some statistical procedures and tests.
df["age_of_death"].var(), df["age_of_death"].var(ddof=0),
(675.3598862019915, 657.5872576177285)
# By taking the square root of the variance, we obtain the standard deviation
np.sqrt(df["age_of_death"].var())
25.98768720378925
Before we look into skewness and kurtosis, we have to broaden our scope and look at distributions.
We already learned a way of displaying (certain stats) of a distribution graphically: The box-plot
But there are additional ways to display a distribution of values:
sns.histplot(x="tip", data=tips)
<AxesSubplot:xlabel='tip', ylabel='Count'>
Histograms sort the single values of the dataset into bins that represent certain value ranges, and then count the number of measurements in each bin, and plot those counts as a barplot.
They are great to visualize distributions, but their quality highly depends, on the number of bins, which is a hyperparameter.
Depending on the number of bins, the result might produce missleading results:
sns.histplot(x="tip", data=tips, bins=3)
<AxesSubplot:xlabel='tip', ylabel='Count'>
In practice, though, you shouldn't worry about that too much, since most plotting libraries have robust routines to infer a suitable number of bins.
Nonetheless, histograms enable you to create a view of your data that might reveal interesting insights that boxplots can't deliver.
To highlight this let's create some data artificially using a Gaussian (or normal distribution).
The Gaussian distribution is the most widely used type of distribution since many phenomena are normally distributed.
For example, the height of humans follows a normal distribution.
In general, the normal distribution can be used to describe phenomena where most of the population is close to the mean, and the probability of encountering samples that deviate from the mean decreases exponentially with the amount of deviation.
from scipy.stats import norm
# Create a normal distribution
dist = norm(scale=1.0, loc=0.0)
# Randomly sample values from the distribution
random_values = dist.rvs(size=10_000)
# Create data to visualize the probability density function (pdf)
idc = np.linspace(random_values.min(), random_values.max(), 10_000)
pdf = norm.pdf(idc)
generated_dataset = pd.DataFrame({"rand": random_values, "idc": idc, "pdf": pdf})
# Plot random values and the pdf
fig, axs = plt.subplots(2, 1)
sns.histplot(x="rand", ax=axs[0], data=generated_dataset)
sns.lineplot(x="idc", y="pdf", color="red", data=generated_dataset)
<AxesSubplot:xlabel='idc', ylabel='pdf'>
Here we sampled some random values from an artificial Gaussian distribution, and also computed the probability density function of the distribution.
This function gives us the probability of drawing a value if randomly sample from the distribution (=> the area under the curve adds up to 1).
If we compare this curve with the histogram, we see that bins in regions of high probability density contain more values than those in lower-density regions.
So, histograms can reveal a discrete approximation of the shape of the distribution of the data (not only for Gaussian distributions!).
Knowing about the shape of the distribution is important to pick the right tools, so it is an essential step in exploratory data analysis.
seaborn
even comes with a plotting function that tries to infer the continuous shape of the underlying distribution:
fig, axs = plt.subplots(3, 1, figsize=(10, 7.5))
titanic = sns.load_dataset("titanic")
fig.suptitle("Age of the passangers of the titanic")
sns.boxplot(x="age", ax=axs[0], data=titanic)
axs[0].set_title("Boxplot")
sns.histplot(x="age", ax=axs[1], data=titanic)
axs[1].set_title("Histogram")
sns.kdeplot(x="age", ax=axs[2], data=titanic)
axs[2].set_title("Density-estimation plot")
plt.tight_layout()
As you can see here, all these three plots offer a slightly different perspective on the data.
And, also we can see that the amount of tips in the tips
dataset seems to follow a sort of Gaussian distribution, but deviates from the "ideal" form in some ways:
To quantify those properties, we can compute skewness and kurtosis.
Skewness measures the asymmetry of a probability distribution.
Skewness can be divided into three types:
There is more than one way to compute skewness, normally the Fishers-Pearsons correlation coefficient is used:
$$ G_1 = \frac{\sqrt{N(N-1)}}{N-2} \frac{\frac{1}{N}\sum_{i=1}^{N}(X_i - \tilde{X})^3}{s^3} $$$\tilde{X} :=$ mean of the dataset
$s^3 :=$ "standard deviation" to the power of $3$
Let's compute the skewness of the age of the titanic's passengers:
titanic["age"].skew()
0.38910778230082704
The value is positive correctly indicating that the tail to the right is longer than on the other side.
Kurtosis describes the "tailedness" or "peakedness" of a distribution.
A distribution with a low value of kurtosis looks more flattened, meaning that it accumulates a lot of its probability mass in the tails, and its peak is more rounded.
https://www.analyticsvidhya.com/blog/2021/05/shape-of-data-skewness-and-kurtosis/
The formula to compute kurtosis is conceptually strongly related to the formula for skewness:
$$ \frac{\sum_{i=1}^{N}(X_i - \tilde{X})^4}{s^4} $$Again, let's compute the kurtosis for the age of the passengers:
titanic["age"].kurt()
0.17827415364210353
x = np.linspace(-1, 1, 1000)
y2 = x**2
y3 = x**3
y4 = x**4
plot_df = pd.DataFrame({"x": x, "y2": y2, "y3": y3, "y4": y4})
sns.lineplot(x=x, y=y4)
<AxesSubplot:>
In contrast, to skewness which is informative by itself, interpreting kurtosis can be more challenging.
To ground kurtosis values they can be compared against the standard normal distribution, as a point of reference.
To do this you simple subtract the actual kurtosis value by three:
$$ \textrm{Excess Kurtosis} = \textrm{Kurtosis} - 3 $$If the Excess kurtosis is:
While univariate measurements give you information about a single variable, more often you'll want to explore potential relationships between multiple variables.
This type of analysis is called bivariate analysis if you'll investigate two variables at once, or multivariate if you search for relationships within more than two variables.
If you follow the standard procedure of exploratory data analysis, then you would first analyze each variable on its own, and then proceed to investigate possible relationships between them.
The most basic way to check if two variables are potentially related is to ask whether a value change of one variable, implies that the other variable changes too.
This relationship in change can come in two flavors: If one value increases the other one might either also increase, or decrease.
If one variable changes while the other one doesn't then there is possibly no relation at all.
Covariance is a simple way to measure those monotone relations:
$$ \textrm{Cov}(X, Y) = \sum_{i=1}^{N} \frac{(x_i - \tilde{X}) - (y_i - \tilde{Y})}{N} $$$\tilde{X}, \tilde{Y} :=$ mean of $X$ and $Y$
To see covariance in action let's generate some artificial data:
x = np.linspace(0, 1, 500)
y = -x
w = (np.random.randn(500)*.15) + x
u = np.tanh(np.sin(x**20))
z = np.random.rand(500)
o = np.linspace(3, 4, 500)
cov_df = pd.DataFrame(dict(x=x, y=y, w=w,u=u, z=z, o=o))
Covariance can also, often be observed visually.
A great way to compare each pair variables in a dataset against each other is seaborn's
pairplot:
sns.pairplot(data=cov_df)
<seaborn.axisgrid.PairGrid at 0x16ae54ca0>
It's visible that there is mostly a clear relationship between those variables (except z).
Now let's compute the covariance matrix between all variables:
cov_mat = cov_df.cov()
sns.heatmap(data=cov_mat, annot=True)
<AxesSubplot:>
By looking at the matrix we can see, that "our assumption" holds true for variables $x, w, y$ and $z$.
Interestingly, although there is a clear relationship between $u$ and $x, w, y$ the covariance is close to $0$ for those pairs.
This is because covariance is only suitable to measure linear relationships (e.g., for any amount that $x$ varies, $y$ varies for the same amount times a fixed factor)
If you look at the covariance map from above, you can see that the amount of covariance between the features varies quite a lot.
Especially, if you look along the diagonal axis, describing the covariance of each feature with itself you see that while each pair has maximum covariance, the actual value is determined by the magnitude of the features, making comparison hard at times.
To alleviate this issue, you can use an alternative (some might say extended) measure to describe the relation between two variables: Correlation
There are many ways to measure correlation, the most commonly used measure is the Pearson Correlation Coefficient:
$$ Corr_{Pearson}(X, Y) = \frac{Cov(X, Y)}{s_X s_Y} $$with
$s_X, s_Y :=$ standard deviation of $X$ and $Y$
Strictly speaking, the Pearson Correlation Coefficient is the standardized form the covariance, therefore inheriting most of the properties of the Covariance, like only being able to account for linear types of correlations.
Let's compute the correlation with Pandas:
corr_mat = cov_df.corr(method="pearson")
sns.heatmap(data=corr_mat, annot=True)
<AxesSubplot:>
A first glance at the diagonal of the correlation matrix reveals, that the correlation is bound to a value of 1.
By looking a the correlation coefficients for $x$ and $y$, we also observe that the correlation values can that correlation can take on negative values if the linear relationship between two variables is negative.
Note:
As you can see, in the code above, we explicitly told pandas to compute the pearson correlation for a data. This is because the .corr
-method suports also supports to compute two types of rank correlation coefficients (spearman and kendall).
These measures do not compare actual values changes within your data, but rather sort the data and then analyze how the ranks of each data point in the sorted list are correlated.
Nonetheless, by default, pandas
computes the pearson
correlation, so stating it isn't strictly necessary.
As we've all learned through popular media during the pandemic, just because two variables correlate it doesn't mean that you can safely assume there is a direct causal reason for that.
Strictly, speaking correlation is a necessary but not sufficient cause for causation!
Consider this example taken from Richard McElreath's book "Statistical Rethinking".
It shows the divorce rate in all US states and the number of waffles houses in those states:
waffles = pd.read_csv("WaffleDivorce.csv", sep=";")
waffles.head()
Location | Loc | Population | MedianAgeMarriage | Marriage | Marriage SE | Divorce | Divorce SE | WaffleHouses | South | Slaves1860 | Population1860 | PropSlaves1860 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Alabama | AL | 4.78 | 25.3 | 20.2 | 1.27 | 12.7 | 0.79 | 128 | 1 | 435080 | 964201 | 0.45 |
1 | Alaska | AK | 0.71 | 25.2 | 26.0 | 2.93 | 12.5 | 2.05 | 0 | 0 | 0 | 0 | 0.00 |
2 | Arizona | AZ | 6.33 | 25.8 | 20.3 | 0.98 | 10.8 | 0.74 | 18 | 0 | 0 | 0 | 0.00 |
3 | Arkansas | AR | 2.92 | 24.3 | 26.4 | 1.70 | 13.5 | 1.22 | 41 | 1 | 111115 | 435450 | 0.26 |
4 | California | CA | 37.25 | 26.8 | 19.1 | 0.39 | 8.0 | 0.24 | 0 | 0 | 0 | 379994 | 0.00 |
sns.regplot(x="WaffleHouses", y="Divorce", data=waffles)
<AxesSubplot:xlabel='WaffleHouses', ylabel='Divorce'>
sns.heatmap(data=waffles[["Divorce", "WaffleHouses"]].corr(), annot=True)
<AxesSubplot:>
While a Pearson correlation $0.25$ is no sign for a strong correlation, this value suggests a least a weak correlation.
But, obviously, there is no good direct explanation that can account for this weird relationship.
A more intriguing cause can be found, by looking at some the median age of marriage:
sns.regplot(x="MedianAgeMarriage", y="Divorce", data=waffles)
<AxesSubplot:xlabel='MedianAgeMarriage', ylabel='Divorce'>
sns.heatmap(data=waffles[["MedianAgeMarriage", "Divorce"]].corr(), annot=True)
<AxesSubplot:>
Not only do these values correlate much stronger, they also offer a better explanation for the divorce rate and can shed light on the weird waffle house riddle.
If we look at the states and their number median age of marriage:
waffles.sort_values(by="MedianAgeMarriage")[["Location", "MedianAgeMarriage"]].head(10)
Location | MedianAgeMarriage | |
---|---|---|
12 | Idaho | 23.2 |
43 | Utah | 23.3 |
49 | Wyoming | 24.2 |
3 | Arkansas | 24.3 |
35 | Oklahoma | 24.4 |
17 | Kentucky | 24.8 |
47 | West Virginia | 25.0 |
16 | Kansas | 25.0 |
1 | Alaska | 25.2 |
42 | Texas | 25.2 |
We see that people in the more rural southern states tend to get married at young ages.
Without further data, it would be speculation what drives them to get married so early. (Possibly more conservative religious beliefs and values...)
But it's safe to assume that marriages of older people who might have spent some time in a relationship before are generally more stable.
Furthermore, to explain the correlation of divorce of WaffleHouse we have to know that the WaffleHouse brand originated in Georgia and is predominantly common in the south of the US.
So the correlation, between the divorce rate and the number of WaffleHouses per state is just a coincidence driven by external factors (that our data does not completely reflect).
So stay aware of spurious correlations and always be critical of your findings!