#%matplotlib notebook
#%matplotlib widget
#%matplotlib ipympl
import matplotlib.pyplot as plt
import numpy as np
np.random.seed(42)
After this part you'll know about:
Even though, lists, dicts, your custom classes, and all the other datatypes are helpful to process your data in its raw form (e.g. texts, images, etc.), if you start to analyze, you'll encode all of this data into numerical representations, like scalars, vectors, or matrices.
The process of encoding all sorts of data numerically is called feature engineering and highly depends on the type of data you want to encode. For now, we skip this process and take a look at the technical side of things.
numpy
(short for Numerical Python) is the canonical library to work with large arrays of numerical data.
Its conceptual core is the powerful and flexible container numpy.ndarray
that can be used to model all sorts of different numerical structures (e.g. vectors, matrices, tensors).
numpy
works¶In addition to that, numpy
provides a large set of functions that efficiently process ndarrays
and provide basic operations of linear algebra.
nd.arrays
also built the basis for a lot of other libraries, like pandas
, or xarray
, rendering numpy
the root of Python's data science stack.
Third-party libraries like scipy
hook into the numpy
world and further provide routines for scientific programming, and there is a plethora of libraries that do the same for more specific domains (e.g. rasterio
for geospatial data).
There is a strong conceptual (and syntactical) relationship between numpy
and deep-learning libraries like tensorflow
and pytorch
, so learning numpy
helps you to quickly come proficient with them if you decide to do deep learning.
Answer: Python isn't particularly fast in processing lists!
And often, numpy
also allows you to write more concise code.
# Doing it the `pythonic` way
from typing import Union, List
NumericalType = Union[float, int]
def dot_product(x: List[NumericalType], y: List[NumericalType]) -> float:
dot_product = 0.0
for xi, yi in zip(x, y):
dot_product += xi * yi
return dot_product
x = list(range(1000))
y = list(range(1000))
print(x[:10], y[:10])
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9] [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
dp = dot_product(x, y)
print(dp)
332833500.0
%%timeit
dp = dot_product(x, y)
44.1 µs ± 240 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
# Doing it the `numpy` way
import numpy as np
x_np = np.array(x)
y_np = np.array(y)
print(x[:10], y[:10])
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9] [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
dp_np = (x_np * y_np).sum()
dp_np
332833500
%%timeit
dp_np = (x_np * y_np).sum()
1.7 µs ± 13.3 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
%%timeit
# Sometimes, you can go even faster, if there is a built-in function available
np.dot(x_np, y_np)
964 ns ± 3.96 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
print(np.dot(x_np, y_np))
332833500
numpy
faster?¶Remember, that Python's lists
can store arbitrary types.
If you iterate over a list, Python has to infer the datatype for each element in order to check which operations can applied to the element.
ndarray
instead have a fixed datatype, allowing only to store elements of the same type, and thereby eliminating the need to perform type checking for each element.
# The np.array function takes in a (potentially nested) list of numbers and converts it into a ndarray
array = np.array([[1, 2, 3.5], [1, 3, 4]])
print("Type of the array:", type(array))
print("Datatype of the content of the array:", array.dtype)
print(array)
Type of the array: <class 'numpy.ndarray'> Datatype of the content of the array: float64 [[1. 2. 3.5] [1. 3. 4. ]]
In contrast to lists
arrays have a fixed size. Once created they can't be expanded or reduced.
This allows numpy
to efficiently apply element-wise operations, as long as both arrays have the same shape.
a, b = np.array([1, 2, 3]), np.array([4, 5, 6])
a + a, a * b, a - b
(array([2, 4, 6]), array([ 4, 10, 18]), array([-3, -3, -3]))
It's also possible to efficiently compute with ndarrays
and scalar values:
a = np.array([1, 2, 3])
b = a * 2
b
array([2, 4, 6])
c = a**2
c
array([1, 4, 9])
Installation
Numpy is a third-party library and is not included in a standard Python installation.
To install it open a terminal, and type pip install numpy
or conda install numpy
depending on your Python setup
Import
By convention, the whole numpy
is imported with the abbreviation np
.
import numpy as np
ndarrays
¶There are many ways to create arrays
.
Here is a quick overview over the most common ones:
my_list = [[1, 2], [3, 4]]
my_array = np.array(my_list)
print(my_array)
[[1 2] [3 4]]
zero_array = np.zeros(shape=(2, 2))
print(zero_array)
ones_array = np.ones((2, 2))
print(ones_array)
constant_array = np.full(shape=(2, 2), fill_value=-100)
print(constant_array)
[[0. 0.] [0. 0.]] [[1. 1.] [1. 1.]] [[-100 -100] [-100 -100]]
running_array = np.arange(0, 9)
print(running_array)
evenly_spaced_array = np.linspace(0, 1, num=5)
print(evenly_spaced_array)
[0 1 2 3 4 5 6 7 8] [0. 0.25 0.5 0.75 1. ]
rand_normal_array = np.random.randn(10000)
print(rand_normal_array.shape)
print("Mean:", rand_normal_array.mean(), "Std:", rand_normal_array.std())
rand_integer_array = np.random.randint(0, 2, (10))
print(rand_integer_array)
(10000,) Mean: -0.002135983368426207 Std: 1.0034122061299875 [1 1 0 0 0 1 1 0 1 0]
ndarray
: Dimensionality¶Creating a ndarray
from a nested list, gives you a multidimensional arrays.
In theory, ndarrays
can have as many dimensions as you like.
Since arrays have a fixed size, each dimension also has a fixed length.
The number of dimension and the size of each dimension determine the shape of an array.
table = np.array([
[1, 2, 3],
[4, 5, 6]
])
print("Number of dimensions:", table.ndim)
print("Shape of the array:", table.shape)
Number of dimensions: 2 Shape of the array: (2, 3)
There are might be special naming conventions depending on the shape of your arrays:
Note: Obviously in a mathematical sense, these terms have more implications but we ignore those for now.
ndarray
: Indexing¶Like lists, you can also use indexing or slicing to address unary values or certain patches of your data.
The syntax mostly follows Python`s conventions but is extended to directly address specific dimensions.
array = np.arange(27).reshape(3, 3, 3)
print(array.ndim)
print(array.shape)
print(array)
3 (3, 3, 3) [[[ 0 1 2] [ 3 4 5] [ 6 7 8]] [[ 9 10 11] [12 13 14] [15 16 17]] [[18 19 20] [21 22 23] [24 25 26]]]
We created a 3dim array, each dimension having a length of 3 entries.
# Get the first element across all dimensions
print(array[0, 0, 0])
# Get the first "matrix"
print(array[0, :, :])
# Get the last row vector of the second matrix
print(array[1, -1, :])
0 [[0 1 2] [3 4 5] [6 7 8]] [15 16 17]
To index or slice with ndarrays
you can specify values for each dimension independently.
It's also possible to combine slicing and indexing at specific dimensions.
Values for dimension are written into one single slicing expression, value are separated by commas.
array[<dim0>, <dim1>, <dim1>, ...]
print(array)
[[[ 0 1 2] [ 3 4 5] [ 6 7 8]] [[ 9 10 11] [12 13 14] [15 16 17]] [[18 19 20] [21 22 23] [24 25 26]]]
What is the output?
array[0, 1, 0]
array[0, 1, 0]
3
array[-1, :2, 0]
array[-1, :2, 0]
array([18, 21])
array[:, :, :]
array[:, :, :]
array([[[ 0, 1, 2], [ 3, 4, 5], [ 6, 7, 8]], [[ 9, 10, 11], [12, 13, 14], [15, 16, 17]], [[18, 19, 20], [21, 22, 23], [24, 25, 26]]])
ndarray
: Masking¶Masking allows you to operate only those values in an array, which fulfill a certain condition.
array = np.random.randint(0, 100, (100, 100))
smaller_fifty_mask = array < 50
print(smaller_fifty_mask)
print(smaller_fifty_mask.shape)
[[False False False ... True False False] [ True False True ... True True True] [ True False True ... True True True] ... [ True False False ... True True False] [False True False ... True False False] [False False True ... False True False]] (100, 100)
A mask is just a ndarray
with boolean values.
If you apply a mask to another array only those values for which the mask-entry is True
are selected.
Mask can be used to perform two kinds of operations:
values_smaller_than_fifty = array[smaller_fifty_mask]
print(values_smaller_than_fifty)
print(values_smaller_than_fifty.shape)
[32 4 10 ... 25 36 1] (4991,)
Note: The result is flattened, so you'll lose all positional information.
array[smaller_fifty_mask] = 0
print(array)
[[81 72 71 ... 0 82 75] [ 0 75 0 ... 0 0 0] [ 0 60 0 ... 0 0 0] ... [ 0 85 88 ... 0 0 53] [50 0 93 ... 0 78 91] [84 79 0 ... 54 0 75]]
ndarray
: Reshaping¶It's possible to change the shape of an existing array.
This operation is called reshaping. Reshaping works by specifying the new number and sizes of dimensions.
To make reshaping work the number of entries in the newly created array has to match the number of entries in the existing array.
vector = np.array([1, 2, 3, 4, 5, 6, 7, 8])
print(vector)
print("Orig shape:", vector.shape)
print("No. of entries:", vector.size)
matrix = vector.reshape(2, 4)
print(matrix)
print("New shape:", matrix.shape)
print("No. of entries:", matrix.size)
tuples = vector.reshape(-1, 2)
print(tuples)
print("New shape:", tuples.shape)
print("No. of entries:", tuples.size)
[1 2 3 4 5 6 7 8] Orig shape: (8,) No. of entries: 8 [[1 2 3 4] [5 6 7 8]] New shape: (2, 4) No. of entries: 8 [[1 2] [3 4] [5 6] [7 8]] New shape: (4, 2) No. of entries: 8
The .reshape
-method takes in as many arguments as desired dimensions, and each argument specifies the number of entries in that dimension.
It's possible to leave the size of one dimension and just write -1
, numpy
will infer the size automatically.
cube = np.arange(27).reshape(3, 3, 3)
arr = cube.reshape(-1)
arr.shape
`
What shape does arr
have?
cube = np.arange(27).reshape(3, 3, 3)
arr = cube.reshape(-1)
arr.shape
arr
array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26])
ndarray
: Computations¶ufuncs
¶As we saw in the introductory example computations with ndarrays
can be amazingly fast (compared to standard Python).
This is achieved by using vectorized functions called ufuncs
that leverage the constrains of ndarrays
to speed up things.
There are two type of ufuncs
:
ufuncs
¶x = np.random.randint(-10, 0, (10,))
print(np.abs(x))
print(np.square(x))
print(np.exp(x))
print("And many more!")
[ 6 9 9 7 10 4 10 8 1 4] [ 36 81 81 49 100 16 100 64 1 16] [2.47875218e-03 1.23409804e-04 1.23409804e-04 9.11881966e-04 4.53999298e-05 1.83156389e-02 4.53999298e-05 3.35462628e-04 3.67879441e-01 1.83156389e-02] And many more!
ufuncs
¶numpy
uses Python's standard arithmetic operators to perform the most essential operations as ufuncs
.
print("Addition")
print(x + x)
print(np.add(x, x))
print("Multiplication")
print(x * x)
print(np.multiply(x, x))
print("Divide")
print(x / x)
print(np.divide(x, x,))
Addition [-12 -18 -18 -14 -20 -8 -20 -16 -2 -8] [-12 -18 -18 -14 -20 -8 -20 -16 -2 -8] Multiplication [ 36 81 81 49 100 16 100 64 1 16] [ 36 81 81 49 100 16 100 64 1 16] Divide [1. 1. 1. 1. 1. 1. 1. 1. 1. 1.] [1. 1. 1. 1. 1. 1. 1. 1. 1. 1.]
All other function in numpy
or scipy
that take in ndarrays
are ufuncs
too.
So as long as you take them out of these libraries there is no need to worry about speed.
Another class of functions offered by numpy
are aggregation functions.
These functons take in a set of ndarray
, apply an operation and aggregate the results and return them.
They are often use to perform summarizing statistics on data.
We already saw two examples of aggregation functions:
data = np.array([1, 2, 3])
data.mean(), data.std()
(2.0, 0.816496580927726)
They compute the average or standard deviation of all values in a ndarray
.
On arrays with more than one dimension, they simply gather all values into a big array on then perform aggregation on it:
data = np.arange(27).reshape(3, 3, 3)
data.mean(), data.reshape(-1).mean()
(13.0, 13.0)
Their real potential lies in their ability to perform aggregation along certain dimensions/ axes.
Example:
Suppose you are an erratic billionaire, who just bought a social media platform, and now quickly has to decrease operational costs, to satisfy your co-financiers.
To decide who is due for layoff, your HR department provides you a dataset of all your programmers and their daily amounts of lines of code written for the past 30 days.
# Shape [n_programmers, n_days]
programmers_loc = np.random.randint(250, 500, size=(600, 30))
decay = np.arange(30) * np.random.randint(0, 20, size=(30,))
programmers_loc -= decay
print("Average number of lines of code written:", programmers_loc.mean())
Average number of lines of code written: 244.63327777777778
But, just computing the mean, doesn't help you to decide whom to fire. You'll need the average number of lines written per employee.
You can compute these values by explicitly stating an axis along which the aggregation function has to operate.
In numpy
axes (or dimensions as we called them earlier) are indexed numerically
Our dataset has the shape [n_programmers, n_days], so to compute the average number of lines of code written in the past 30 days we have to aggreate all values of the second dimension (index=1):
average_loc_per_employee = programmers_loc.mean(axis=1)
print(average_loc_per_employee.shape)
(600,)
Doing so, returns as another ndarray
with 600 entries, each representing the mean over 30 days.
Using another aggregation function argmin
which returns the index of the smallest element in an array / per dimension, you can now easily find the most unproductive programmer.
np.min(average_loc_per_employee), np.argmin(average_loc_per_employee)
(202.46666666666667, 495)
Obviously, it's possible to perform those kinds of operations along all dimensions. Computing the mean, along the first axis (index=0), would return you the average number of LoC over that past 30 days:
plt.plot(programmers_loc.mean(axis=0))
[<matplotlib.lines.Line2D at 0x10cae88e0>]
Most aggregation operations are available in two "flavors", either as function accessible via the np object,
or via the concrete ndarray
itself.
np.mean(programmers_loc), programmers_loc.mean()
(244.63327777777778, 244.63327777777778)
Since there is no functional difference between those "flavors", it's up to you to choose :)
ndarray
: Broadcasting¶As we've seen now numpy
is fast in processing ndarrays
of the same shape.
This does not only apply for vectors, but also for arrays of arbitrary dimensionality:
data1 = np.random.randint(0, 10, size=(2, 2 ,2))
data2 = np.ones((2, 2, 2), dtype=np.int64)
print(data1)
print(data1 + data2)
[[[5 5] [3 1]] [[3 1] [6 8]]] [[[6 6] [4 2]] [[4 2] [7 9]]]
But, we've also seen that it is possible to perform arithmetic operations, with ndarrays
and scalar values.
print(data1 + 1)
[[[6 6] [4 2]] [[4 2] [7 9]]]
It is also possible to compute with arrays of different shapes under some circumstances:
matrix = np.arange(9).reshape(3, 3)
print(matrix)
row_vec = np.array([1, -1, 5])
print(row_vec)
print("matrix + row_vec=\n", matrix + row_vec)
col_vec = np.array([[1], [-1], [5]])
print(col_vec.shape)
print(col_vec)
print("matrix + col_vec=\n", matrix + col_vec)
[[0 1 2] [3 4 5] [6 7 8]] [ 1 -1 5] matrix + row_vec= [[ 1 0 7] [ 4 3 10] [ 7 6 13]] (3, 1) [[ 1] [-1] [ 5]] matrix + col_vec= [[ 1 2 3] [ 2 3 4] [11 12 13]]
To perform those operations on ndarrays
with mixed dimensionality, numpy
uses broadcasting to automatically adjust the dimensions of the array with fewer dimensions, by padding it to the right size.
To successfully perform broadcasting some constraints on the shape and size of the arrays have to be fulfilled:
for dim_a, dim_b in zip(a.shape[::-1], b.shape[::-1]):
if not (1 in (dim_a, dim_b) or dim_a == dim_b):
raise ValueError("operands could not be broadcast together")
If arrays do not have the same number of dimensions the array with fewer dimensions is expanded by adding dimensions to the left.
(np.([1]).unsqueeze(0) => np.ndarray([[1]])
)
Before we start let's load an image.
from skimage import data
picture = data.astronaut()
print(type(picture))
print("Shape of the picture", picture.shape)
plt.imshow(picture)
plt.show()
<class 'numpy.ndarray'> Shape of the picture (512, 512, 3)
We, can see that the picture has a spatial resolution of 512 * 512 pixels.
The picture follows the RGB model so each pixel is described by the amount of red, green and blue color.
The channels make up the last dimension of the picture.
Using indexing it is easily possible to look at each channel independently:
channel_map = {0: "red", 1: "green", 2: "blue"}
for channel_index in range(picture.shape[2]):
channel_name = channel_map[channel_index]
plt.imshow(picture[:, :, channel_index], cmap="Greys")
plt.title(f"Channel: {channel_name}")
plt.show()
The RGB color model is an additive one. By adding the maximum amount of all three basic colors we'll receive the color white.
white = np.full((512, 512, 3), 255)
white[128:384, 128:384, :] = [127, 127, 255]
plt.imshow(white)
plt.show()
Vice versa by not adding any amount of all colors our pixel will be black. Grayscale images do not encode any color and just represent the amount of light visible in the picture.
To create a grayscale version of an RGB-image we have to compute the amount of light, using the following formula: $$ Y= 0.2126 R+ 0.7152 G+ 0.0722 B $$ This is effectively a weighted sum over all three channels of the image.
Since we want to perform this operation for all 262144 pixels in the image broadcasting comes in naturally:
# Create an array with rgb to grayscale weight
rgb_grayscale_weights = np.array([0.2125, 0.7154, 0.0721])
print("First pixel in original image", picture[0, 0, :])
# First, we use broadcasting to scale all values in the RGB channels
scaled_rgb_picture = rgb_grayscale_weights * picture
print("First pixel with scaled rgb values", scaled_rgb_picture[0, 0, :])
# To receive the gray values, just sum over the channels for each pixel
grayscaled_picture = scaled_rgb_picture.sum(axis=2)
print("Grayscaled image shape", grayscaled_picture.shape)
plt.imshow(grayscaled_picture, cmap="gray")
plt.show()
First pixel in original image [154 147 151] First pixel with scaled rgb values [ 32.725 105.1638 10.8871] Grayscaled image shape (512, 512)