Data Science for Humanities 2¶

Session: Python's Data Science Stack¶

Part 1: Seaborn¶

Summer term 25¶

Prof. Goran Glavaš, Lennart Keller¶

Seaborn¶

Objective: Statistical visualizations.

High-level API based Matplotlib with strong integration of pandas.

In [1]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
In [2]:
df = sns.load_dataset("penguins")
In [3]:
df.head(3)
Out[3]:
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex
0 Adelie Torgersen 39.1 18.7 181.0 3750.0 Male
1 Adelie Torgersen 39.5 17.4 186.0 3800.0 Female
2 Adelie Torgersen 40.3 18.0 195.0 3250.0 Female

High-level API:

Seaborn functions work on entire datasets and take care of many steps, such as aggregating data automatically.

Example: relplot

The relplot function is designed to visualize static relationships of all kinds:

In [4]:
sns.relplot(
    x="bill_length_mm", y="bill_depth_mm",
    data=df,
)
Out[4]:
<seaborn.axisgrid.FacetGrid at 0x157e693d0>
No description has been provided for this image

With the help of a few arguments of the plotting function, you can add more variables to the plot.

Here, for example, the coloring of the scatter dots indicates the species of the penguins:

In [5]:
sns.relplot(
    x="bill_length_mm", y="bill_depth_mm",
    hue="species",
    data=df,
)
Out[5]:
<seaborn.axisgrid.FacetGrid at 0x16c802690>
No description has been provided for this image

We can also change the dot's size according to their weights

In [6]:
sns.relplot(
    x="bill_length_mm", y="bill_depth_mm",
    hue="species",
    size="body_mass_g",
    data=df,
)
Out[6]:
<seaborn.axisgrid.FacetGrid at 0x16c91a150>
No description has been provided for this image

Using the parameters col and row, multiple plots can be created based on a categorical variable:

In [7]:
sns.relplot(
    x="bill_length_mm", y="bill_depth_mm",
    hue="sex",
    size="body_mass_g",
    col="species",
    row="island",
    data=df,
)
Out[7]:
<seaborn.axisgrid.FacetGrid at 0x16c94e810>
No description has been provided for this image

Continuous relationships can also be visualized using line plots (more on that later)...

Plot types:¶

Seaborn provides functions for different types of visualizations:

Distributions¶

Generate histograms or similar plots.

In [8]:
sns.displot(
    x="body_mass_g", col="species",
    hue="sex",
    kde=True,
    data=df
)
Out[8]:
<seaborn.axisgrid.FacetGrid at 0x17f447450>
No description has been provided for this image
In [9]:
sns.displot(
    x="body_mass_g", col="species",
    hue="sex",
    kind="kde",
    data=df
)
Out[9]:
<seaborn.axisgrid.FacetGrid at 0x16c88edd0>
No description has been provided for this image

Categorical data¶

Generate plots showing distributions split by certain values for categorical variables.

In [10]:
sns.catplot(
    x="species", y="body_mass_g",
    kind="boxen",
    data=df
)
Out[10]:
<seaborn.axisgrid.FacetGrid at 0x17f52dc50>
No description has been provided for this image

However, it also works without classes...

In [11]:
sns.catplot(
    y="body_mass_g",
    kind="box",
    data=df
)
Out[11]:
<seaborn.axisgrid.FacetGrid at 0x31b48a210>
No description has been provided for this image
In [12]:
sns.catplot(
    x="species", y="body_mass_g",
    kind="violin",
    data=df
)
Out[12]:
<seaborn.axisgrid.FacetGrid at 0x31b50f450>
No description has been provided for this image
In [13]:
sns.catplot(
    x="species", y="body_mass_g",
    hue="sex",
    data=df
)
Out[13]:
<seaborn.axisgrid.FacetGrid at 0x16c939650>
No description has been provided for this image
In [14]:
sns.catplot(
    x="species", y="body_mass_g",
    hue="sex",
    kind="swarm",
    data=df
)
Out[14]:
<seaborn.axisgrid.FacetGrid at 0x31b5a4950>
No description has been provided for this image
In [15]:
sns.catplot(
    x="species", y="body_mass_g",
    hue="sex",
    kind="bar",
    data=df
)
Out[15]:
<seaborn.axisgrid.FacetGrid at 0x31b491390>
No description has been provided for this image

Regression plot¶

Fits a regression model to the data to be visualized and also plots certain model parameters.

Can be a neat way to visualize (linear) relations within your data.

In [16]:
sns.lmplot(
    x="body_mass_g", y="bill_length_mm",
    hue="sex",
    col="species",
    data=df,
)
Out[16]:
<seaborn.axisgrid.FacetGrid at 0x31b6cd4d0>
No description has been provided for this image

Multivariate Beziehungen¶

Especially in exploratory data analysis, it can be informative to plot different measurements or display formats in combination to gain more "global" insights.

The pairplot, for example, plots all variables of a data set against each other:

In [17]:
sns.pairplot(hue="species", data=df)
Out[17]:
<seaborn.axisgrid.PairGrid at 0x31b56d790>
No description has been provided for this image

With the jointplot the display types histogram and scatterplot are combined:

In [18]:
sns.jointplot(
    x="flipper_length_mm", y="bill_length_mm",
    hue="species",
    data=df
)
Out[18]:
<seaborn.axisgrid.JointGrid at 0x31c4e5210>
No description has been provided for this image

Seaborn and Pandas: Data Formats¶

Seaborn is designed to work with Panda's DataFrames.

The whole DateFrame can be passed with the data parameter and then columns can be selected using their name.

In [19]:
data = pd.DataFrame({
    "x": np.linspace(0, 20, 10000),
    "y": np.sin(np.linspace(0, 20, 10000))
})
In [20]:
sns.lineplot(x="x", y="y", data=data)
Out[20]:
<Axes: xlabel='x', ylabel='y'>
No description has been provided for this image

However, Seaborn also accepts other data types:

In [21]:
x = np.linspace(0, 20, 10000)
y = np.sin(x)

sns.lineplot(x=x, y=y)
Out[21]:
<Axes: >
No description has been provided for this image
In [22]:
sns.histplot(y)
Out[22]:
<Axes: ylabel='Count'>
No description has been provided for this image

etc..

But of course you lose many of the helpful features of the DataFrame integration. (Most notably: Automatic axes labeling!).

DateFrames: Long- vs. Wide-form¶

DataFrames can contain data in different formats. For example, in longform format, where each variable has its own column.

Or in wideform format, which is more like traditional Excel spreadsheets and only contrasts two values.

pandas is best at handling longform-based data:

In [23]:
flights = sns.load_dataset("flights")
flights.head()
Out[23]:
year month passengers
0 1949 Jan 112
1 1949 Feb 118
2 1949 Mar 132
3 1949 Apr 129
4 1949 May 121

Here the data for the vast majority of plots are automatically aggregated and correctly prepared.

For example, here the spread of the number of passenger per month is automatically aggregated by year:

In [24]:
sns.lineplot(x="year", y="passengers", data=flights)
Out[24]:
<Axes: xlabel='year', ylabel='passengers'>
No description has been provided for this image
In [25]:
sns.lineplot(x="year", y="passengers", hue="month", data=flights)
Out[25]:
<Axes: xlabel='year', ylabel='passengers'>
No description has been provided for this image

The same mechanism also works the other way round:

In [26]:
sns.lineplot(x="month", y="passengers", data=flights)
Out[26]:
<Axes: xlabel='month', ylabel='passengers'>
No description has been provided for this image
In [27]:
sns.lineplot(x="month", y="passengers", hue="year", data=flights.query("month != 'Jan'"))
Out[27]:
<Axes: xlabel='month', ylabel='passengers'>
No description has been provided for this image

Messy Data:¶

Some datasets also come in more complex formats. For example, different hierarchical levels could be mixed.

In [28]:
freqs = pd.read_csv("freqs-engl.txt", sep="\t")
freqs.head()
---------------------------------------------------------------------------
FileNotFoundError                         Traceback (most recent call last)
Cell In[28], line 1
----> 1 freqs = pd.read_csv("freqs-engl.txt", sep="\t")
      2 freqs.head()

File /opt/homebrew/Caskroom/miniconda/base/envs/python_intro/lib/python3.11/site-packages/pandas/io/parsers/readers.py:912, in read_csv(filepath_or_buffer, sep, delimiter, header, names, index_col, usecols, dtype, engine, converters, true_values, false_values, skipinitialspace, skiprows, skipfooter, nrows, na_values, keep_default_na, na_filter, verbose, skip_blank_lines, parse_dates, infer_datetime_format, keep_date_col, date_parser, date_format, dayfirst, cache_dates, iterator, chunksize, compression, thousands, decimal, lineterminator, quotechar, quoting, doublequote, escapechar, comment, encoding, encoding_errors, dialect, on_bad_lines, delim_whitespace, low_memory, memory_map, float_precision, storage_options, dtype_backend)
    899 kwds_defaults = _refine_defaults_read(
    900     dialect,
    901     delimiter,
   (...)
    908     dtype_backend=dtype_backend,
    909 )
    910 kwds.update(kwds_defaults)
--> 912 return _read(filepath_or_buffer, kwds)

File /opt/homebrew/Caskroom/miniconda/base/envs/python_intro/lib/python3.11/site-packages/pandas/io/parsers/readers.py:577, in _read(filepath_or_buffer, kwds)
    574 _validate_names(kwds.get("names", None))
    576 # Create the parser.
--> 577 parser = TextFileReader(filepath_or_buffer, **kwds)
    579 if chunksize or iterator:
    580     return parser

File /opt/homebrew/Caskroom/miniconda/base/envs/python_intro/lib/python3.11/site-packages/pandas/io/parsers/readers.py:1407, in TextFileReader.__init__(self, f, engine, **kwds)
   1404     self.options["has_index_names"] = kwds["has_index_names"]
   1406 self.handles: IOHandles | None = None
-> 1407 self._engine = self._make_engine(f, self.engine)

File /opt/homebrew/Caskroom/miniconda/base/envs/python_intro/lib/python3.11/site-packages/pandas/io/parsers/readers.py:1661, in TextFileReader._make_engine(self, f, engine)
   1659     if "b" not in mode:
   1660         mode += "b"
-> 1661 self.handles = get_handle(
   1662     f,
   1663     mode,
   1664     encoding=self.options.get("encoding", None),
   1665     compression=self.options.get("compression", None),
   1666     memory_map=self.options.get("memory_map", False),
   1667     is_text=is_text,
   1668     errors=self.options.get("encoding_errors", "strict"),
   1669     storage_options=self.options.get("storage_options", None),
   1670 )
   1671 assert self.handles is not None
   1672 f = self.handles.handle

File /opt/homebrew/Caskroom/miniconda/base/envs/python_intro/lib/python3.11/site-packages/pandas/io/common.py:859, in get_handle(path_or_buf, mode, encoding, compression, memory_map, is_text, errors, storage_options)
    854 elif isinstance(handle, str):
    855     # Check whether the filename is to be opened in binary mode.
    856     # Binary mode does not support 'encoding' and 'newline'.
    857     if ioargs.encoding and "b" not in ioargs.mode:
    858         # Encoding
--> 859         handle = open(
    860             handle,
    861             ioargs.mode,
    862             encoding=ioargs.encoding,
    863             errors=errors,
    864             newline="",
    865         )
    866     else:
    867         # Binary mode
    868         handle = open(handle, ioargs.mode)

FileNotFoundError: [Errno 2] No such file or directory: 'freqs-engl.txt'

Example: Comparing the frequencies of you and thoufor tragedies and comedies.

To generate a histogram of the frequencies of the two words for both genres, we need to convert the data into long-form using the .melt method of DataFrames.

In [ ]:
plot_df = freqs.query("genre == 'tragedy' or genre == 'comedy'").melt(
    id_vars=["genre", "title", "year"],
    value_vars=["you", "thou"],
    var_name="token",
    value_name="freq"
)
plot_df

Since we lose data by applying this transformation, it is recommended to save the result in a new DataFrame...

In [ ]:
sns.displot(
    x="freq",
    hue="token",
    col="genre",
    kde=True,
    data=plot_df
)

Matplotlib als Seaborn-Backend und weitere Anpassungsmöglichkeiten.¶

seaborn uses matplotlib as a backend framework to create the plots.

This means, it is to extend seaborn plots with matplotlib.

However, this is not necessary in all cases where you want to customize seaborn plots, because seaborn itself also provides some functions for this.

For this you have to distinguish between two types of plots:

  • axes_level plots
  • figure_level plots

axes_level plots return a matplotlib axes object containing the plot while figure_level plots return a FacetGrid object containing the plot.

FacetGrid¶

FacetGrid objects are special containers that seaborn uses to encapsulate one (or more) graphic(s) and the data they generate.

In [29]:
df.head(3)
Out[29]:
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex
0 Adelie Torgersen 39.1 18.7 181.0 3750.0 Male
1 Adelie Torgersen 39.5 17.4 186.0 3800.0 Female
2 Adelie Torgersen 40.3 18.0 195.0 3250.0 Female
In [30]:
g = sns.FacetGrid(df)
No description has been provided for this image

You can assign individual columns and rows of a 'FacetGrid' to specific variables from the data set.

In [31]:
g = sns.FacetGrid(df, col="species", row="sex", hue="island")
No description has been provided for this image

Using the .map method of Facetgrid, it is possible to apply various plotting functions to each subplot (and its associated data) of a FacetGrid.

In [32]:
g.map(sns.scatterplot, "body_mass_g", "bill_length_mm")
g.add_legend()
g.figure
Out[32]:
No description has been provided for this image

Certain plotting functions of Seaborn require the data as DataFrame via the data parameter. To apply those functions to the FacetGrid too, you can use the .map_dataframe method.

In [33]:
g = sns.FacetGrid(df, col="species", row="sex", hue="island")

g.map_dataframe(sns.swarmplot, y="body_mass_g")
g.add_legend()
g.figure
Out[33]:
No description has been provided for this image
No description has been provided for this image

FacetGrid objects encapsulate the subplots they contain in the axes attribute.

In [34]:
g.axes
Out[34]:
array([[<Axes: title={'center': 'sex = Male | species = Adelie'}, ylabel='body_mass_g'>,
        <Axes: title={'center': 'sex = Male | species = Chinstrap'}>,
        <Axes: title={'center': 'sex = Male | species = Gentoo'}>],
       [<Axes: title={'center': 'sex = Female | species = Adelie'}, ylabel='body_mass_g'>,
        <Axes: title={'center': 'sex = Female | species = Chinstrap'}>,
        <Axes: title={'center': 'sex = Female | species = Gentoo'}>]],
      dtype=object)
In [35]:
g.axes[0][0].set_title("1.")
g.axes[0][1].set_title("2.")
g.figure
Out[35]:
No description has been provided for this image

The entire graphic is stored in the figure attribute.

These objects are again classic matplotlib graphics and can be adapted or processed accordingly.

In [36]:
g.figure.suptitle("My first custom FacetGrid :-)", y=1.1)
g.figure
Out[36]:
No description has been provided for this image

The advantage of 'FacetGrids' is that you can create and customize your own plots quite flexibly without having to drop any of seaborn's convenient features.

figure_level-Plots¶

High-level plot functions, such as relplot, catplot or displot mostly return a FacetGrid object.

In [37]:
g = sns.catplot(x="species", y="body_mass_g", hue="sex", data=df)
No description has been provided for this image
In [38]:
type(g)
Out[38]:
seaborn.axisgrid.FacetGrid

Since FacetGrid serve as containers for axes, figure, they are poorly adapted to other graphics and should be used to create a coherent graphic.

axes_level-Plots¶

As the name suggests, axes_level plots return a matplotlib axes object. axes_level plots are intended to be a drop-in replacement for matplotlib functions and can be well integrated into other plots or matplotlib workflows.

In [39]:
data.head(3)
Out[39]:
x y
0 0.000 0.000
1 0.002 0.002
2 0.004 0.004
In [40]:
fig, axes = plt.subplots(2, 1)
axes[0].plot(data["x"], data["y"])
axes[0].set_title("Sine Curve")
sns.histplot(x=data["y"], ax=axes[1])
axes[1].set_title("Histogram of sine values")
fig.tight_layout()
fig.suptitle("Example for a combined matplotlib and seaborn plot", y=1.1)
plt.show()
No description has been provided for this image
In [ ]:
 
In [ ]:
 
In [ ]: