Data Science for Humanities 1¶

Session: (Re-)introduction to Python¶

Part 2: Functions, Functional Programming & OOP¶

Winter term 22/23¶

Prof. Goran Glavaš, Lennart Keller¶

Goals¶

After this part you'll know about:

  • How to define functions
  • The basics of functional programming and
    • List comprehension
    • First-Order Functions
  • The basics of object oriented programming and especially:
    • How to interact with objects
    • How to define custom classes
    • How to use inheritance to customize existing classes

Functions¶

  • Functions are the easiest way of making parts of your code reusable
  • As we already saw for built-in functions, they receive one or multiple inputs and give you back one output or multiple outputs
Example: Mean¶
In [1]:
def mean(values):
    return sum(values) / len(values)
In [2]:
mean([1,2,3])
Out[2]:
2.0

How to define custom functions¶

  • Keyword def signifies the start of a function definition
  • def is followed by the function name (must be unique within the current program)
  • The function is followed by a list of arguments in round brackets
    • If the function does not take in any arguments, the rounded brackets are still necessary
  • The function body contains the code of the function
  • Function bodies can be closed in two different ways:
    • Explicitly: By using the return keyword and defining what the function returns
    • Implicitly: Without the return keyword and just changing the indentation. In this case, the return value is None
In [3]:
# A function with two arguments
def concat_strings(string_1, string_2):
    return string_1 + string_2

# A function with no arguments
def smile():
    return "😊"

print(concat_strings("Hello ", "World"))
print(smile())
Hello World
😊
In [4]:
def print_greeting(name):
    print(f"Hello {name}")
In [5]:
type(print_greeting("Lennart"))
Hello Lennart
Out[5]:
NoneType

Functions: Scopes¶

Inside the function's body, you can create new variables which only live in this local scope:

In [6]:
%%script python --no-raise-error

def concat_strings(string_1, string_2):
    concatenated_string = string_1 + string_2
    return concatenated_string
print(concat_strings("Hello ", "World"))
print(concatenated_string)
Hello World
Traceback (most recent call last):
  File "<stdin>", line 6, in <module>
NameError: name 'concatenated_string' is not defined

Note: It is possible to go the other way round and access outer-scope, global variables from inside a function, but you should never do this unless you have a good reason to do so. In all other cases, always pass all required data explicitly as arguments!

Functions as objects¶

Functions are objects that can be passed around like any other variable/ datatype.

For example, this allows to pass a function to other functions:

In [52]:
authors = ["Goethe", "Kracht", "Fontane", "Schiller", "Hahn"]

print(sorted(authors))

print(len("Goethe"))
print(sorted(authors, key=len))
['Fontane', 'Goethe', 'Hahn', 'Kracht', 'Schiller']
6
['Hahn', 'Goethe', 'Kracht', 'Fontane', 'Schiller']

Functions: Default arguments¶

Sometimes you'll end with a function that has argument(s) that usually have the same value, but they occasionally might change.

To alleviate this, you can define default values for the arguments.

In [53]:
def product(values, start=1):
    product = start
    for value in values:
        product = product * value
    return product
In [54]:
print(product([1, 2, 3]))
print(product([1, 2, 3], start=2))
6
12

You can usually use the argument_name=value syntax to make explicit which value to assign the which argument.

In [55]:
print(product(values=[1,2,3], start=1))
6

If you do this for all arguments, you can also rearrange the order of the arguments.

Otherwise, their mapping to the arguments is inferred by order of values.

List expressions¶

If processing data you often want to a apply a operation to all elements in a list, tuple, etc.

In [56]:
len_names = []
for name in authors:
    len_names.append(len(name))
print(len_names)
[6, 6, 7, 8, 4]

Using list expression you can shorten these recurring code blocks into one elegant single expression:

In [57]:
len_names = [len(name) for name in authors]
print(len_names)
[6, 6, 7, 8, 4]
In [59]:
" ".join([token.capitalize() for token in "my thesis title".split()])
Out[59]:
'My Thesis Title'

Doing so has several benefits:

  • You shorten your code and improve the readability
  • You avoid side effects
  • Python has special abilities in optimizing list expressions, so you'll gain some performance improvements.

Filtering in list expressions¶

You can also use a trailing if statement to filter out any values that do not satisfy a condition.

In [14]:
even_numbers = [number for number in range(1, 26) if number % 2 == 0]
print(even_numbers)
[2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24]

If-else statements in list expressions¶

Using the short inline form of if-else, it is also possible to build more complex forms of expressions

In [15]:
even_numbers_str = [
    f"{number} is even" if number % 2 == 0 else f"{number} is odd"
    for number in range(1, 26)
]
print(even_numbers_str)
['1 is odd', '2 is even', '3 is odd', '4 is even', '5 is odd', '6 is even', '7 is odd', '8 is even', '9 is odd', '10 is even', '11 is odd', '12 is even', '13 is odd', '14 is even', '15 is odd', '16 is even', '17 is odd', '18 is even', '19 is odd', '20 is even', '21 is odd', '22 is even', '23 is odd', '24 is even', '25 is odd']

Object oriented programming (OOP)¶

Core intution¶

Objects store data, and functions that operate on the data.

Example: Strings¶
In [16]:
text = "OOP can be helpful" # <- Data: Sequence of characters
text.upper() # <- `Function` (actually it is called method)
             # that operates on the data of the string in variable `text`
Out[16]:
'OOP CAN BE HELPFUL'

We already learned about Python's most straight forward way to bundle data:

In [17]:
my_dataset = {
    "excavation_ids": ["south_1", "south_2", "south_3", "north_1", "north_2", "east_1", "east_2"],
    "num_artifacts": [1, 3, 4, 10, 9, 2, 5]
}

Now suppose we want to check in which direction we on average find the most artificats:

In [18]:
def mean_by_direction(excavation_ids, num_artifacts):
    directions = [excavation_id.split("_")[0] for excavation_id in excavation_ids]
    results = {}
    for uniq_direction in set(directions):
        aggregated_num_findings = [
            n_artifacts
            for excavation_id, n_artifacts in zip(excavation_ids, num_artifacts)
            if excavation_id.startswith(uniq_direction)
        ]
        results[uniq_direction] = mean(aggregated_num_findings)
    return results
In [60]:
mean_by_direction(my_dataset["excavation_ids"], my_dataset["num_artifacts"])
Out[60]:
{'south': 2.6666666666666665, 'north': 9.5, 'east': 3.5}

This function is specifically tailored to the dataset, so it wouldn't make much sense to reuse it in other scenarios, but it would be cool to have it every time you interact with the dataset.

But, since we can also pass a function around as we do with other variables, we can also store the function in the dictionary:

In [20]:
excavation_dataset = {
    "excavation_ids": ["south_1", "south_2", "south_3", "north_1", "north_2", "east_1", "east_2"],
    "num_artifacts": [1, 3, 4, 10, 9, 2, 5],
    "func_mean_by_direction": mean_by_direction
}

So now, we can access the function via the dataset-dict.

In [21]:
excavation_dataset["func_mean_by_direction"](
    excavation_dataset["excavation_ids"],
    excavation_dataset["num_artifacts"]
)
Out[21]:
{'south': 2.6666666666666665, 'north': 9.5, 'east': 3.5}

Even though it is - in theory - possible to bundle data and functions this way, it is messy, clunky and overly complicated!

-> Never do it this way

This one of many problem setting, where object oriented programming shines.

Let's create a custom class representing our dataset:

In [22]:
class ExcavationDataset:
    
    # Constructor of the class, defines how a object is created
    # Its main purpose is to define the way data is stored within the object.
    def __init__(self, excavation_ids, num_artifacts):
        # The self-argument in each method represents the object itself.
        # It is used to access the attributes and methods of the object.
        # If you call a method, it's not necessary to care about the self argument,
        # Python will set it automatically
        # Each following argument is exposed to the outside and must be set when calling the method.
        
        # By assinging fields to self you can create attributes
        # (I.e., internal variables that store data within an object)
        
        self.excavation_ids = excavation_ids
        self.num_artifacts = num_artifacts
    
    # Besides special methods such as the __init__-constructor-method, you can also define custom methods.
    # The mean_by_direction method only operates on the internal data of the objects,
    # so it does not receive any external arguments. 
    def mean_by_direction(self):
        directions = [
            excavation_id.split("_")[0]
            # Using the self-argument you can access attributes and other methods of the object
            for excavation_id in self.excavation_ids 
        ]
        results = {}
        for uniq_direction in set(directions):
            aggregated_num_findings = [
                n_artifacts
                for excavation_id, n_artifacts in zip(self.excavation_ids, self.num_artifacts)
                if excavation_id.startswith(uniq_direction)
            ]
            results[uniq_direction] = mean(aggregated_num_findings)
        return results

Now let's create a object of this class, which stores our data:

In [62]:
dataset = ExcavationDataset(
    excavation_ids=["south_1", "south_2", "south_3", "north_1", "north_2", "east_1", "east_2"],
    num_artifacts=[1, 3, 4, 10, 9, 2, 5]
)
type(dataset)
Out[62]:
__main__.ExcavationDataset

Like for dicts we can access the contents of an object

In [24]:
dataset.excavation_ids
Out[24]:
['south_1', 'south_2', 'south_3', 'north_1', 'north_2', 'east_1', 'east_2']
In [25]:
dataset.num_artifacts
Out[25]:
[1, 3, 4, 10, 9, 2, 5]

And you can call your custom methods the same way, you'd call methods of other objects (e.g., strings, dicts, ...):

In [26]:
dataset.mean_by_direction()
Out[26]:
{'south': 2.6666666666666665, 'north': 9.5, 'east': 3.5}

Terminology¶

Class¶

Defintion of how to structure and create objects of the same kind (like a Blueprint)

Instance/ object¶

Concrete manifestion of a class. They contain concrete data. Mutliple objects of the same class can be instantiated.

Method¶

Functions that are bound to an object, and can access its data (via the self-argument).

Attribute/ Field¶

Data of an object.

Inheritance¶

Classes can inherit properties (e.g. methods and attributes) from other classes.

Superclass and Subclass¶

If class B inherits properties from class A, class A is called superclass of B. Vice versa Class B is called subclass (or child-class) of A.

I enrolled for a data science course, why should I care about OOP?¶

  • Most python libraries are structured using the OOP paradigm, forcing you to interact with objects. Knowing their structure and rules, you can work more efficiently and concentrate on the things that matter most - the data and your research question(s).

  • If you want to customize things or even newly implement your own ideas, chances are high that you are required to do this in an OOP manner. So you'll need a thorough understanding of classes and the basics of inheritance.

  • Additionally, structuring your code in the OOP paradigm often enables you to store your data in semantically structured objects. Doing so not only spares you hatred from your collaborators but also from your future self since it makes your code much more accessible and understandable.

Example for a statistcal model built by using OOP:¶

In [27]:
class RegressionModel:
    def __init__(self, learning_rate=1.0, n_iterations=100):
        self.learning_rate = learning_rate
        self.n_iterations = n_iterations
    
    def predict(self, x):
        y_pred = [self.weight_ * xi + self.bias_ for xi in x]
        return y_pred
        
    def fit(self, x, y):
        self.weight_, self.bias_ = sum(x) / len(x), 0.0
        for iteration in range(self.n_iterations):
            weight_gradient, bias_gradient = 0.0, 0.0
            for xi, yi in zip(x, y):
                weight_gradient += xi * (self.predict([xi])[0] - yi)
                bias_gradient += self.predict([xi])[0] - yi
            self.weight_ -= self.learning_rate * (weight_gradient / len(x)) 
            self.bias_ -= self.learning_rate * (bias_gradient / len(x))
        return self

Now you can easily:

  • Train the model
  • Access its parameters
  • Use it to make prediction on (mulitple) other datsets
  • Save it and load it later
  • ...
In [28]:
import matplotlib.pyplot as plt

x, y = [i/100 for i in range(1, 101, 5)], [3-((i/100)) for i in range(1, 101, 5)]

regressor = RegressionModel().fit(x, y)
print(f"Learned weight: {regressor.weight_} and bias: {regressor.bias_}")

x_pred = [i/100 for i in range(-25, 126, 5)]
y_pred = regressor.predict(x_pred)
plt.scatter(x, y)
plt.plot(x_pred, y_pred, color="red")
plt.show()
Learned weight: -0.99751188719238 and bias: 2.998707444486552
In [63]:
import pickle

# Save model to disk
with open("my_model.pickle", "wb") as out_f:
    pickle.dump(regressor, out_f)

# Delete the model from memory
del regressor

# Load it from disk
with open("my_model.pickle", "rb") as in_f:
    regressor = pickle.load(in_f)

print(f"Learned weight: {regressor.weight_} and bias: {regressor.bias_}")
Learned weight: -0.99751188719238 and bias: 2.998707444486552

As you can see in the example above, OOP enables you to bundle some parameters (weights and bias) and the routines that describe how to apply these parameters to the data in one single location. This allows for easy access and interoperability.

OOP: Inheritance¶

An essential pillar of Object-Oriented Programming is the concept of inheritance. Inheritance enables classes (and, therefore, objects) to inherit attributes and methods from other types (=classes). Using this mechanism, you can derive subclasses from another class and only update parts incompatible with its new purpose.

Example: Extended ExcavationDataset class¶

Now we also want to store the depth in wich the artifacts where found.

In [30]:
# Recall the ExcavationDataset

class ExcavationDataset:
    def __init__(self, excavation_ids, num_artifacts):
        self.excavation_ids = excavation_ids
        self.num_artifacts = num_artifacts
    
    def mean_by_direction(self):
        directions = [
            excavation_id.split("_")[0]
            for excavation_id in self.excavation_ids 
        ]
        results = {}
        for uniq_direction in set(directions):
            aggregated_num_findings = [
                n_artifacts
                for excavation_id, n_artifacts in zip(self.excavation_ids, self.num_artifacts)
                if excavation_id.startswith(uniq_direction)
            ]
            results[uniq_direction] = mean(aggregated_num_findings)
        return results
In [64]:
# We state the class we want to inherit in round brackets after the class name.
class DepthExcavationDataset(ExcavationDataset):
    def __init__(self, excavation_ids, num_artifacts, depth_of_artifacts):
        # We can leverage the constructor of our superclass
        # since it already defines how to handle the values of excavation_ids and num_artifacts.
        # Calling the super function within a method returns the superclass of the current class.
        super().__init__(excavation_ids=excavation_ids, num_artifacts=num_artifacts)

        # We only define how to handle the new arguments. 
        self.depth_of_artifacts = depth_of_artifacts
    
    # Now, we can add new methods to the subclass
    def count_artifacts_depth_range(self, min_depth, max_depth):
        return sum([
            n_artifacts 
            for n_artifacts, depth in zip(self.num_artifacts, self.depth_of_artifacts)
            if min_depth <= depth <= max_depth
        ])
In [65]:
dataset = DepthExcavationDataset(
    excavation_ids=["south_1", "south_2", "south_3", "north_1", "north_2", "east_1", "east_2"],
    num_artifacts=[1, 3, 4, 10, 9, 2, 5],
    depth_of_artifacts=[0.3, 0.2, 1.0, 0.1, 0.2, 0.5, 0.5]
)

Now we can use the new method(s):

In [33]:
dataset.count_artifacts_depth_range(0.1, 0.4)
Out[33]:
23

And all those from the superclass:

In [34]:
dataset.mean_by_direction()
Out[34]:
{'south': 2.6666666666666665, 'north': 9.5, 'east': 3.5}

Addendum: Documentation and type annotations¶

Functions and OOP aim at making your code interoperable and reusable! Still, one mandatory ingredient is missing to ensure that your future-you and your colleagues understand what your classes and functions do: Documentation.

Documenting a function¶

In [66]:
from typing import List
from string import punctuation

def tokenize(text: str) -> List:
    """Tokenizes a text and strips all punctuation marks from single tokens.
    
    Args:
        text: The text to tokenize

    Returns:
        The list of tokens 
    """
    tokens = [token.strip(punctuation) for token in text.split() if token.strip()]
    return tokens

print(help(tokenize))

print(tokenize("""

My dear friend,
Documenting my code is a hardly bearable chore....
But I will pull myself together for the noble cause!!!
Yours truly,
A responsible programmer

"""))
Help on function tokenize in module __main__:

tokenize(text: str) -> List
    Tokenizes a text and strips all punctuation marks from single tokens.
    
    Args:
        text: The text to tokenize
    
    Returns:
        The list of tokens

None
['My', 'dear', 'friend', 'Documenting', 'my', 'code', 'is', 'a', 'hardly', 'bearable', 'chore', 'But', 'I', 'will', 'pull', 'myself', 'together', 'for', 'the', 'noble', 'cause', 'Yours', 'truly', 'A', 'responsible', 'programmer']
  • Docstrings concisely describe what your function does, how the input should look, and how the output is structured.
  • Type annotations describe what types your arguments can have, and the data type of the returned values of the function (after the -> arrow)
  • All of that is parsed by the help function and assembled into a human-readable manual of your function.

Documenting Classes¶

Luckily, methods within a class can be documented like functions. Additionally, you can add further documentation to describe the class as a whole.

In [36]:
from typing import List, Dict
class ExcavationDataset:
    """This class represents a dataset describing a set of excavation sites.
    Attributes:
        excavation_ids: ID of each individual excavation site of format "<direction>_<num>"
        num_artifacts: The number of found artifacts at each site.
    """
    def __init__(self, excavation_ids: List[str], num_artifacts: List[int]):
        """Constructor of the class.
        Attributes:
            excavation_ids: See class description
            num_artifacts: See class description
        """
        self.excavation_ids = excavation_ids
        self.num_artifacts = num_artifacts
    
    def mean_by_direction(self) -> Dict[str, float]:
        """Returns the mean number of found artifacts per direction.
        """
        directions = [
            excavation_id.split("_")[0]
            for excavation_id in self.excavation_ids 
        ]
        results = {}
        for uniq_direction in set(directions):
            aggregated_num_findings = [
                n_artifacts
                for excavation_id, n_artifacts in zip(self.excavation_ids, self.num_artifacts)
                if excavation_id.startswith(uniq_direction)
            ]
            results[uniq_direction] = mean(aggregated_num_findings)
        return results
In [37]:
help(ExcavationDataset)
Help on class ExcavationDataset in module __main__:

class ExcavationDataset(builtins.object)
 |  ExcavationDataset(excavation_ids: List[str], num_artifacts: List[int])
 |  
 |  This class represents a dataset describing a set of excavation sites.
 |  Attributes:
 |      excavation_ids: ID of each individual excavation site of format "<direction>_<num>"
 |      num_artifacts: The number of found artifacts at each site.
 |  
 |  Methods defined here:
 |  
 |  __init__(self, excavation_ids: List[str], num_artifacts: List[int])
 |      Constructor of the class.
 |      Attributes:
 |          excavation_ids: See class description
 |          num_artifacts: See class description
 |  
 |  mean_by_direction(self) -> Dict[str, float]
 |      Returns the mean number of found artifacts per direction.
 |  
 |  ----------------------------------------------------------------------
 |  Data descriptors defined here:
 |  
 |  __dict__
 |      dictionary for instance variables (if defined)
 |  
 |  __weakref__
 |      list of weak references to the object (if defined)

Type annotations¶

Python's typing module provides a large set of prebuilt type-annotations. We won't go through that in detail. If you want to read up on this, you can start with this tutorial.