In [1]:
#%matplotlib notebook
#%matplotlib widget
#%matplotlib ipympl
import matplotlib.pyplot as plt
import numpy as np
np.random.seed(42)

Data Science for Humanities 1¶

Session: (Re-)introduction to Python¶

Part 3: Numpy & Scientific Programming: Moving from lists to arrays¶

Winter term 22/23¶

Prof. Goran Glavaš, Lennart Keller¶

Goals¶

After this part you'll know about:

  • Basics and benefits of Numpy
  • Properties of arrays in Numpy
  • How to create arrays
  • How to index and slice arrays
  • How to reshape arrays depending on your need
  • How to efficiently perform vectorized computations on arrays
  • Basics of broadcasting

Introduction:¶

Even though, lists, dicts, your custom classes, and all the other datatypes are helpful to process your data in its raw form (e.g. texts, images, etc.), if you start to analyze, you'll encode all of this data into numerical representations, like scalars, vectors, or matrices.

The process of encoding all sorts of data numerically is called feature engineering and highly depends on the type of data you want to encode. For now, we skip this process and take a look at the technical side of things.

numpy (short for Numerical Python) is the canonical library to work with large arrays of numerical data. Its conceptual core is the powerful and flexible container numpy.ndarray that can be used to model all sorts of different numerical structures (e.g. vectors, matrices, tensors).

Benefits about learning how numpy works¶

In addition to that, numpy provides a large set of functions that efficiently process ndarrays and provide basic operations of linear algebra.

nd.arrays also built the basis for a lot of other libraries, like pandas, or xarray, rendering numpy the root of Python's data science stack.

Third-party libraries like scipy hook into the numpy world and further provide routines for scientific programming, and there is a plethora of libraries that do the same for more specific domains (e.g. rasterio for geospatial data).

There is a strong conceptual (and syntactical) relationship between numpy and deep-learning libraries like tensorflow and pytorch, so learning numpy helps you to quickly come proficient with them if you decide to do deep learning.

But, why don't we use built-in types like lists to store numbers?¶

Answer: Python isn't particularly fast in processing lists!

And often, numpy also allows you to write more concise code.

Example: Dot-Product¶

$$ \vec{a} \cdot \vec{b} = \sum^{i=1}_{N} a_i b_i $$
In [2]:
# Doing it the `pythonic` way
from typing import Union, List

NumericalType = Union[float, int]

def dot_product(x: List[NumericalType], y: List[NumericalType]) -> float:
    dot_product = 0.0
    for xi, yi in zip(x, y):
        dot_product += xi * yi
    return dot_product
In [3]:
x = list(range(1000))
y = list(range(1000))

print(x[:10], y[:10])
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9] [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
In [4]:
dp = dot_product(x, y)
print(dp)
332833500.0
In [5]:
%%timeit
dp = dot_product(x, y)
44.1 µs ± 240 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [6]:
# Doing it the `numpy` way
import numpy as np

x_np = np.array(x)
y_np = np.array(y)

print(x[:10], y[:10])
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9] [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
In [7]:
dp_np = (x_np * y_np).sum()
dp_np
Out[7]:
332833500
In [8]:
%%timeit
dp_np = (x_np * y_np).sum()
1.7 µs ± 13.3 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
In [9]:
%%timeit
# Sometimes, you can go even faster, if there is a built-in function available
np.dot(x_np, y_np)
964 ns ± 3.96 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
In [10]:
print(np.dot(x_np, y_np))
332833500

Why is numpy faster?¶

Reason 1: Fixed datatypes¶

Remember, that Python's lists can store arbitrary types. If you iterate over a list, Python has to infer the datatype for each element in order to check which operations can applied to the element.

ndarray instead have a fixed datatype, allowing only to store elements of the same type, and thereby eliminating the need to perform type checking for each element.

In [11]:
# The np.array function takes in a (potentially nested) list of numbers and converts it into a ndarray

array = np.array([[1, 2, 3.5], [1, 3, 4]])
print("Type of the array:", type(array))
print("Datatype of the content of the array:", array.dtype)
print(array)
Type of the array: <class 'numpy.ndarray'>
Datatype of the content of the array: float64
[[1.  2.  3.5]
 [1.  3.  4. ]]

Reason 2: Vectorized operations¶

In contrast to lists arrays have a fixed size. Once created they can't be expanded or reduced.

This allows numpy to efficiently apply element-wise operations, as long as both arrays have the same shape.

In [12]:
a, b = np.array([1, 2, 3]), np.array([4, 5, 6])
a + a, a * b, a - b
Out[12]:
(array([2, 4, 6]), array([ 4, 10, 18]), array([-3, -3, -3]))

It's also possible to efficiently compute with ndarrays and scalar values:

In [13]:
a = np.array([1, 2, 3])
b = a * 2
b
Out[13]:
array([2, 4, 6])
In [14]:
c = a**2
c
Out[14]:
array([1, 4, 9])

Numpy: Usage basics¶

Installation

Numpy is a third-party library and is not included in a standard Python installation.

To install it open a terminal, and type pip install numpy or conda install numpy depending on your Python setup

Import

By convention, the whole numpy is imported with the abbreviation np.

import numpy as np

Creating ndarrays¶

There are many ways to create arrays.

Here is a quick overview over the most common ones:

  • Creating from existing (nested) lists
In [15]:
my_list = [[1, 2], [3, 4]]
my_array = np.array(my_list)
print(my_array)
[[1 2]
 [3 4]]
  • Creating arrays with fixed content
In [16]:
zero_array = np.zeros(shape=(2, 2))
print(zero_array)

ones_array = np.ones((2, 2))
print(ones_array)

constant_array = np.full(shape=(2, 2), fill_value=-100)
print(constant_array)
[[0. 0.]
 [0. 0.]]
[[1. 1.]
 [1. 1.]]
[[-100 -100]
 [-100 -100]]
  • Creating arrays with consecutive content
In [17]:
running_array = np.arange(0, 9)
print(running_array)

evenly_spaced_array = np.linspace(0, 1, num=5)
print(evenly_spaced_array)
[0 1 2 3 4 5 6 7 8]
[0.   0.25 0.5  0.75 1.  ]
  • Creating arrays with random numbers
In [18]:
rand_normal_array = np.random.randn(10000)
print(rand_normal_array.shape)
print("Mean:", rand_normal_array.mean(), "Std:", rand_normal_array.std())

rand_integer_array = np.random.randint(0, 2, (10))
print(rand_integer_array)
(10000,)
Mean: -0.002135983368426207 Std: 1.0034122061299875
[1 1 0 0 0 1 1 0 1 0]

ndarray: Dimensionality¶

Creating a ndarray from a nested list, gives you a multidimensional arrays. In theory, ndarrays can have as many dimensions as you like.

Since arrays have a fixed size, each dimension also has a fixed length.

The number of dimension and the size of each dimension determine the shape of an array.

In [19]:
table = np.array([
    [1, 2, 3],
    [4, 5, 6]
])
print("Number of dimensions:", table.ndim)
print("Shape of the array:", table.shape)
Number of dimensions: 2
Shape of the array: (2, 3)

There are might be special naming conventions depending on the shape of your arrays:

  • 1-dim arrays are sometimes referred to as vectors
  • 2-dim arrays might be called matrices

Note: Obviously in a mathematical sense, these terms have more implications but we ignore those for now.

ndarray: Indexing¶

Like lists, you can also use indexing or slicing to address unary values or certain patches of your data.

The syntax mostly follows Python`s conventions but is extended to directly address specific dimensions.

In [48]:
array = np.arange(27).reshape(3, 3, 3)
print(array.ndim)
print(array.shape)
print(array)
3
(3, 3, 3)
[[[ 0  1  2]
  [ 3  4  5]
  [ 6  7  8]]

 [[ 9 10 11]
  [12 13 14]
  [15 16 17]]

 [[18 19 20]
  [21 22 23]
  [24 25 26]]]

We created a 3dim array, each dimension having a length of 3 entries.

In [49]:
# Get the first element across all dimensions
print(array[0, 0, 0])

# Get the first "matrix"
print(array[0, :, :])

# Get the last row vector of the second matrix
print(array[1, -1, :])
0
[[0 1 2]
 [3 4 5]
 [6 7 8]]
[15 16 17]

To index or slice with ndarrays you can specify values for each dimension independently.

It's also possible to combine slicing and indexing at specific dimensions.

Values for dimension are written into one single slicing expression, value are separated by commas.

array[<dim0>, <dim1>, <dim1>, ...]

Quiztime¶

In [50]:
print(array)
[[[ 0  1  2]
  [ 3  4  5]
  [ 6  7  8]]

 [[ 9 10 11]
  [12 13 14]
  [15 16 17]]

 [[18 19 20]
  [21 22 23]
  [24 25 26]]]

What is the output?

array[0, 1, 0]
In [51]:
array[0, 1, 0]
Out[51]:
3
array[-1, :2, 0]
In [52]:
array[-1, :2, 0]
Out[52]:
array([18, 21])
array[:, :, :]
In [54]:
array[:, :, :]
Out[54]:
array([[[ 0,  1,  2],
        [ 3,  4,  5],
        [ 6,  7,  8]],

       [[ 9, 10, 11],
        [12, 13, 14],
        [15, 16, 17]],

       [[18, 19, 20],
        [21, 22, 23],
        [24, 25, 26]]])

ndarray: Masking¶

Masking allows you to operate only those values in an array, which fulfill a certain condition.

In [56]:
array = np.random.randint(0, 100, (100, 100))
smaller_fifty_mask = array < 50
print(smaller_fifty_mask)
print(smaller_fifty_mask.shape)
[[False False False ...  True False False]
 [ True False  True ...  True  True  True]
 [ True False  True ...  True  True  True]
 ...
 [ True False False ...  True  True False]
 [False  True False ...  True False False]
 [False False  True ... False  True False]]
(100, 100)

A mask is just a ndarray with boolean values.

If you apply a mask to another array only those values for which the mask-entry is True are selected.

Mask can be used to perform two kinds of operations:

  1. Retrieving
  2. Inplace operations

Retrieving¶

In [57]:
values_smaller_than_fifty = array[smaller_fifty_mask]
print(values_smaller_than_fifty)
print(values_smaller_than_fifty.shape)
[32  4 10 ... 25 36  1]
(4991,)

Note: The result is flattened, so you'll lose all positional information.

In-place operations¶

In [58]:
array[smaller_fifty_mask] = 0
print(array)
[[81 72 71 ...  0 82 75]
 [ 0 75  0 ...  0  0  0]
 [ 0 60  0 ...  0  0  0]
 ...
 [ 0 85 88 ...  0  0 53]
 [50  0 93 ...  0 78 91]
 [84 79  0 ... 54  0 75]]

ndarray: Reshaping¶

It's possible to change the shape of an existing array.

This operation is called reshaping. Reshaping works by specifying the new number and sizes of dimensions.

To make reshaping work the number of entries in the newly created array has to match the number of entries in the existing array.

In [27]:
vector = np.array([1, 2, 3, 4, 5, 6, 7, 8])
print(vector)
print("Orig shape:", vector.shape)
print("No. of entries:", vector.size)

matrix = vector.reshape(2, 4)
print(matrix)
print("New shape:", matrix.shape)
print("No. of entries:", matrix.size)

tuples = vector.reshape(-1, 2)
print(tuples)
print("New shape:", tuples.shape)
print("No. of entries:", tuples.size)
[1 2 3 4 5 6 7 8]
Orig shape: (8,)
No. of entries: 8
[[1 2 3 4]
 [5 6 7 8]]
New shape: (2, 4)
No. of entries: 8
[[1 2]
 [3 4]
 [5 6]
 [7 8]]
New shape: (4, 2)
No. of entries: 8

The .reshape-method takes in as many arguments as desired dimensions, and each argument specifies the number of entries in that dimension.

It's possible to leave the size of one dimension and just write -1, numpy will infer the size automatically.

Quiz¶

cube = np.arange(27).reshape(3, 3, 3)
arr = cube.reshape(-1)
arr.shape
`

What shape does arr have?

In [60]:
cube = np.arange(27).reshape(3, 3, 3)
arr = cube.reshape(-1)
arr.shape
arr
Out[60]:
array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16,
       17, 18, 19, 20, 21, 22, 23, 24, 25, 26])

ndarray: Computations¶

ufuncs¶

As we saw in the introductory example computations with ndarrays can be amazingly fast (compared to standard Python).

This is achieved by using vectorized functions called ufuncs that leverage the constrains of ndarrays to speed up things.

There are two type of ufuncs:

  • Unary functions that only operate on a single input and produce a single output
  • Binary functions take in two inputs and produce one output

Examples: Unary ufuncs¶

In [28]:
x = np.random.randint(-10, 0, (10,))
In [29]:
print(np.abs(x))
print(np.square(x))
print(np.exp(x))
print("And many more!")
[ 6  9  9  7 10  4 10  8  1  4]
[ 36  81  81  49 100  16 100  64   1  16]
[2.47875218e-03 1.23409804e-04 1.23409804e-04 9.11881966e-04
 4.53999298e-05 1.83156389e-02 4.53999298e-05 3.35462628e-04
 3.67879441e-01 1.83156389e-02]
And many more!

Binary ufuncs¶

numpy uses Python's standard arithmetic operators to perform the most essential operations as ufuncs.

In [30]:
print("Addition")
print(x + x)
print(np.add(x, x))

print("Multiplication")
print(x * x)
print(np.multiply(x, x))

print("Divide")
print(x / x)
print(np.divide(x, x,))
Addition
[-12 -18 -18 -14 -20  -8 -20 -16  -2  -8]
[-12 -18 -18 -14 -20  -8 -20 -16  -2  -8]
Multiplication
[ 36  81  81  49 100  16 100  64   1  16]
[ 36  81  81  49 100  16 100  64   1  16]
Divide
[1. 1. 1. 1. 1. 1. 1. 1. 1. 1.]
[1. 1. 1. 1. 1. 1. 1. 1. 1. 1.]

All other function in numpy or scipy that take in ndarrays are ufuncs too.

So as long as you take them out of these libraries there is no need to worry about speed.

Aggregation functions¶

Another class of functions offered by numpy are aggregation functions.

These functons take in a set of ndarray, apply an operation and aggregate the results and return them.

They are often use to perform summarizing statistics on data.

We already saw two examples of aggregation functions:

In [31]:
data = np.array([1, 2, 3])
data.mean(), data.std()
Out[31]:
(2.0, 0.816496580927726)

They compute the average or standard deviation of all values in a ndarray.

On arrays with more than one dimension, they simply gather all values into a big array on then perform aggregation on it:

In [32]:
data = np.arange(27).reshape(3, 3, 3)
data.mean(), data.reshape(-1).mean()
Out[32]:
(13.0, 13.0)

Their real potential lies in their ability to perform aggregation along certain dimensions/ axes.

Example:

Suppose you are an erratic billionaire, who just bought a social media platform, and now quickly has to decrease operational costs, to satisfy your co-financiers.

To decide who is due for layoff, your HR department provides you a dataset of all your programmers and their daily amounts of lines of code written for the past 30 days.

In [33]:
# Shape [n_programmers, n_days]
programmers_loc = np.random.randint(250, 500, size=(600, 30))
decay = np.arange(30) * np.random.randint(0, 20, size=(30,))
programmers_loc -= decay

print("Average number of lines of code written:", programmers_loc.mean())
Average number of lines of code written: 244.63327777777778

But, just computing the mean, doesn't help you to decide whom to fire. You'll need the average number of lines written per employee.

You can compute these values by explicitly stating an axis along which the aggregation function has to operate. In numpy axes (or dimensions as we called them earlier) are indexed numerically

Our dataset has the shape [n_programmers, n_days], so to compute the average number of lines of code written in the past 30 days we have to aggreate all values of the second dimension (index=1):

In [34]:
average_loc_per_employee = programmers_loc.mean(axis=1)
print(average_loc_per_employee.shape)
(600,)

Doing so, returns as another ndarray with 600 entries, each representing the mean over 30 days.

Using another aggregation function argmin which returns the index of the smallest element in an array / per dimension, you can now easily find the most unproductive programmer.

In [35]:
np.min(average_loc_per_employee), np.argmin(average_loc_per_employee)
Out[35]:
(202.46666666666667, 495)

Obviously, it's possible to perform those kinds of operations along all dimensions. Computing the mean, along the first axis (index=0), would return you the average number of LoC over that past 30 days:

In [36]:
plt.plot(programmers_loc.mean(axis=0))
Out[36]:
[<matplotlib.lines.Line2D at 0x10cae88e0>]

Side-Note: Methods vs. functions¶

Most aggregation operations are available in two "flavors", either as function accessible via the np object, or via the concrete ndarray itself.

In [62]:
np.mean(programmers_loc), programmers_loc.mean()
Out[62]:
(244.63327777777778, 244.63327777777778)

Since there is no functional difference between those "flavors", it's up to you to choose :)

ndarray: Broadcasting¶

As we've seen now numpy is fast in processing ndarrays of the same shape.

This does not only apply for vectors, but also for arrays of arbitrary dimensionality:

In [38]:
data1 = np.random.randint(0, 10, size=(2, 2 ,2))
data2 = np.ones((2, 2, 2), dtype=np.int64)

print(data1)

print(data1 + data2)
[[[5 5]
  [3 1]]

 [[3 1]
  [6 8]]]
[[[6 6]
  [4 2]]

 [[4 2]
  [7 9]]]

But, we've also seen that it is possible to perform arithmetic operations, with ndarrays and scalar values.

In [39]:
print(data1 + 1)
[[[6 6]
  [4 2]]

 [[4 2]
  [7 9]]]

It is also possible to compute with arrays of different shapes under some circumstances:

In [63]:
matrix = np.arange(9).reshape(3, 3)
print(matrix)

row_vec = np.array([1, -1, 5])
print(row_vec)
print("matrix + row_vec=\n", matrix + row_vec)

col_vec = np.array([[1], [-1], [5]])
print(col_vec.shape)
print(col_vec)
print("matrix + col_vec=\n", matrix + col_vec)
[[0 1 2]
 [3 4 5]
 [6 7 8]]
[ 1 -1  5]
matrix + row_vec=
 [[ 1  0  7]
 [ 4  3 10]
 [ 7  6 13]]
(3, 1)
[[ 1]
 [-1]
 [ 5]]
matrix + col_vec=
 [[ 1  2  3]
 [ 2  3  4]
 [11 12 13]]

To perform those operations on ndarrays with mixed dimensionality, numpy uses broadcasting to automatically adjust the dimensions of the array with fewer dimensions, by padding it to the right size.

Source: https://jakevdp.github.io/PythonDataScienceHandbook/02.05-computation-on-arrays-broadcasting.html

To successfully perform broadcasting some constraints on the shape and size of the arrays have to be fulfilled:

for dim_a, dim_b in zip(a.shape[::-1], b.shape[::-1]):
    if not (1 in (dim_a, dim_b) or dim_a == dim_b):
        raise ValueError("operands could not be broadcast together")

If arrays do not have the same number of dimensions the array with fewer dimensions is expanded by adding dimensions to the left.

(np.([1]).unsqueeze(0) => np.ndarray([[1]]))

Broadcasting Example: Convert RGB- to grayscale-images¶

Before we start let's load an image.

In [64]:
from skimage import data

picture = data.astronaut()
print(type(picture))
print("Shape of the picture", picture.shape)

plt.imshow(picture)
plt.show()
<class 'numpy.ndarray'>
Shape of the picture (512, 512, 3)

We, can see that the picture has a spatial resolution of 512 * 512 pixels.

The picture follows the RGB model so each pixel is described by the amount of red, green and blue color.

The channels make up the last dimension of the picture.

Using indexing it is easily possible to look at each channel independently:

In [66]:
channel_map = {0: "red", 1: "green", 2: "blue"}
for channel_index in range(picture.shape[2]):
    channel_name = channel_map[channel_index]
    plt.imshow(picture[:, :, channel_index], cmap="Greys")
    plt.title(f"Channel: {channel_name}")
    plt.show()

The RGB color model is an additive one. By adding the maximum amount of all three basic colors we'll receive the color white.

In [43]:
white = np.full((512, 512, 3), 255)
white[128:384, 128:384, :] = [127, 127, 255]
plt.imshow(white)
plt.show()

Vice versa by not adding any amount of all colors our pixel will be black. Grayscale images do not encode any color and just represent the amount of light visible in the picture.

To create a grayscale version of an RGB-image we have to compute the amount of light, using the following formula: $$ Y= 0.2126 R+ 0.7152 G+ 0.0722 B $$ This is effectively a weighted sum over all three channels of the image.

Since we want to perform this operation for all 262144 pixels in the image broadcasting comes in naturally:

In [44]:
# Create an array with rgb to grayscale weight
rgb_grayscale_weights = np.array([0.2125, 0.7154, 0.0721])

print("First pixel in original image", picture[0, 0, :])

# First, we use broadcasting to scale all values in the RGB channels
scaled_rgb_picture = rgb_grayscale_weights * picture
print("First pixel with scaled rgb values", scaled_rgb_picture[0, 0, :])

# To receive the gray values, just sum over the channels for each pixel
grayscaled_picture = scaled_rgb_picture.sum(axis=2)

print("Grayscaled image shape", grayscaled_picture.shape)
plt.imshow(grayscaled_picture, cmap="gray")

plt.show()
First pixel in original image [154 147 151]
First pixel with scaled rgb values [ 32.725  105.1638  10.8871]
Grayscaled image shape (512, 512)

Further reading¶

  • Python Data Science Handbook: Introduction into numpy