This week:
seabornNext week:
A collection of statistical tools to summarize the quantitative characteristics of your data.
Use your data to draw conclusions, making forecasts or predict unseen data points.
The term population refers to the full (and complete) set of members of the things you want to measure/ collect data on.
For example, if we want to analyze the amount of Eroticism within german dime novels, the population would be all german dime novels, that were ever published.
Since it is seldom possible to collect all members of a population, we have to select some members of the population.
This selection is called a sample.
To able able to generalize from our sample to the population, we want our sample to be representative of the population, so we have to draw the single samples in a manner that ensures this.
Often, you assume that by randomly sampling from the population, you create a representative sample.
But be aware, that random sampling is not always the best or feasible way to create representative samples.
For example, suppose the dime novel example from above:
In Germany, dime novels were published for roughly 150 years.
But only contemporary novels are digitalized, so if you would just draw random samples, from the pool of digitalized novels, you would skew your sample towards modern dime novels, and would not be able to generalize your findings to earlier works.
Unfortunately, in those cases it can get very tedious to assemble a representative dataset, because often it requires manual effort (e.g., digitizing earlier novels).
Parameters are values that are generated by a population.
=> Note: The term parameter has multiple meanings in statistics and data science, so take this notion as very tailored to descriptive statistics.
Example:
The number mean number of terms that indicate physical affection (kiss, caress, ...) in all german dime novels.
Statistics are values that can be inferred based on a sample.
Example:
The number mean number of terms that indicate physical affection (kiss, caress, ...) in a sample of german dime novels
The quantity of specific values (or value ranges) within one one variable of your data.
Distributions can be divided in certain types (normal, exponential, ...)
The most common type of distribution is the gaussian or normal distribution.
Data often follows this distribution
Statistical analysis of only one set of measurements
Example: How tall are people?
Joint statistical analysis of two or multiple measurements.
Example: How is there are relationship between peoples height and weight (and gender)?
Given a simple dataset consisting of a single set of measurements.
How would you start to get a glimpse of the data?
Example dataset: Age of deceased inhabitants in Accrington (Town in northwestern England) in 1830.
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
df = pd.read_csv("aod.csv")
df
| age_of_death | |
|---|---|
| 0 | 0 |
| 1 | 0 |
| 2 | 0 |
| 3 | 0 |
| 4 | 0 |
| 5 | 0 |
| 6 | 0 |
| 7 | 1 |
| 8 | 1 |
| 9 | 1 |
| 10 | 1 |
| 11 | 2 |
| 12 | 3 |
| 13 | 8 |
| 14 | 12 |
| 15 | 13 |
| 16 | 15 |
| 17 | 17 |
| 18 | 20 |
| 19 | 21 |
| 20 | 22 |
| 21 | 22 |
| 22 | 25 |
| 23 | 29 |
| 24 | 32 |
| 25 | 35 |
| 26 | 38 |
| 27 | 39 |
| 28 | 41 |
| 29 | 47 |
| 30 | 48 |
| 31 | 54 |
| 32 | 57 |
| 33 | 63 |
| 34 | 73 |
| 35 | 78 |
| 36 | 80 |
| 37 | 82 |
Sum of all values divided by the total number of values.
df["age_of_death"].mean()
25.789473684210527
df["age_of_death"].median()
20.5
Question: The dataset measures birthdays (i.e., no fraction of years), so why is the median $20.5$?
df.shape
(38, 1)
If a dataset contains an even number of values, there is no middle value. To compute a valid median the average of the two "middle values" is taken
df.iloc[len(df)//2-1:len(df)//2+1]
| age_of_death | |
|---|---|
| 18 | 20 |
| 19 | 21 |
df.iloc[len(df)//2-1:len(df)//2+1]["age_of_death"].mean()
20.5
The most common value in the dataset.
from collections import Counter
Counter(df["age_of_death"]).most_common(1)
[(0, 7)]
df["age_of_death"].mode()
0 0 Name: age_of_death, dtype: int64
Takes all data points into account.
Can produce potentially misleading results, because it is affected by outliers.
The median is not affected by outliers and should be favored when the data is widespread or contains outliers.
The mode is the - by far - least often used measure of central tendency.
It doesn't take into account multiple values, so in general, it has to be accompanied by other measures to paint the whole picture.