Data Science for Humanities 1¶

Session: (Re-)introduction to Python¶

Part 1: Fundamentals¶

Winter term 22/23¶

Prof. Goran Glavaš, Lennart Keller¶

Goals¶

After this part you'll know about:

  • The basic properties of Python
  • How to define variables
  • Python's basic data types
  • Conditions
  • Loops

Prologue: Before we start some philosophical poetry 😊¶

In [1]:
import this
The Zen of Python, by Tim Peters

Beautiful is better than ugly.
Explicit is better than implicit.
Simple is better than complex.
Complex is better than complicated.
Flat is better than nested.
Sparse is better than dense.
Readability counts.
Special cases aren't special enough to break the rules.
Although practicality beats purity.
Errors should never pass silently.
Unless explicitly silenced.
In the face of ambiguity, refuse the temptation to guess.
There should be one-- and preferably only one --obvious way to do it.
Although that way may not be obvious at first unless you're Dutch.
Now is better than never.
Although never is often better than *right* now.
If the implementation is hard to explain, it's a bad idea.
If the implementation is easy to explain, it may be a good idea.
Namespaces are one honking great idea -- let's do more of those!

Python¶

What is Python so popular?¶

  • The swiss-army-knife of programming languages – It's not optimized for a specific use case.
    • Instead and due to its vast ecosystem of third-party libraries, people use Python for various tasks.
    • Examples: Data Science (Statistics, Machine Learning, Visualization, etc.), Web Development, Data Conversion, Scripting, and everything else that you can think of (except those tasks where speed is crucial)
  • Very High-Level programming language – Agnostic towards the OS and Hardware it runs on; ships with most of the standard data- and control structures.
  • REPL (read-eval-print loop) – Allowing explorative interaction with the language, making it easy for beginners to do their first steps gradually.

Some properties that help making programming in Python easy¶

  • Simple and concise syntax
    • Indentation is semantically meaningful and required by the language itself
  • Interpreted language: Code gets translated to machine code during runtime
    • No compilation step needed; allows intermediate feedback
  • Dynamic and strong typing: Variables can change and have multiple datatypes, but they don't change their type unless explicitly stated.
  • Multi-paradigm language: Although the core language-design highly relies on the Object-Oriented-Paradigm, Python also supports a subset of functional programming techniques.

Python's syntax, or: Programming with pseudocode¶

In [2]:
def count_word_lengths(text):
    words = text.split(" ")
    all_word_lengths = []
    for word in words:
        word_length = len(word)
        all_word_lengths.append(word_length)
    return all_word_lengths
In [3]:
count_word_lengths(
    "Programming in Python alleviates executing your research ideas very quickly"
)
Out[3]:
[11, 2, 6, 10, 9, 4, 8, 5, 4, 7]

First things first: Hello World¶

In [4]:
print("Hello World")
Hello World

Done 🤗

A little bit of contextual knowledge:¶

print is called a function.

  • Function (usually) take in arguments (=data) and do something with it.
  • Arguments are passed to the function within the round brackets after the function's name.
  • A function name with trailing round brackets is called a function call
    • Some functions do not require arguments (i.e., data as input), but they still have to be called to execute them!

The print-function takes in data and writes it the current output.

In [5]:
print(0, 1, 3)
print("A", "B", "C")
print(0, "A", 1, "B", 2, "C")
0 1 3
A B C
0 A 1 B 2 C
In [6]:
from time import time

print("Current unix timestamp is", time()) # <= The time-function does not need any input,
                                           # but steel needs round brackets to be evoked!

print("Current unix timestamp is", time)   # Otherwise, the result will be meaningless...
Current unix timestamp is 1667892896.643146
Current unix timestamp is <built-in function time>

Basics: Variables¶

  • Variables are used to store and retrieve data.
  • They consist of a name and a value that is bound to the name
In [7]:
my_data = 0
print(my_data)
0
  • The value of a variable might change but its name is fixed
In [8]:
my_data = 1
print(my_data)
1
  • The value determines the datatype of a variable
    • The type-function takes in any variable and returns the datatype of its current value
In [9]:
print(type(my_data))
<class 'int'>
  • Python allows changing the datatype of a variable (-> dynamic typing)
In [10]:
my_data = "Hello World"
print(type(my_data))
print(my_data)
<class 'str'>
Hello World
  • But does not the change type unless explicitly forced to do so (-> static typing)
In [11]:
my_data = "1"
my_number = 1

print(my_data + my_number)
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
/var/folders/tx/s3wb7bvx4blgb8xnlhfxvw7h0000gn/T/ipykernel_3250/1373215352.py in <module>
      2 my_number = 1
      3 
----> 4 print(my_data + my_number)

TypeError: can only concatenate str (not "int") to str
In [12]:
print(int(my_data) + my_number)
2

Naming conventions for variables¶

As we will see later, there are more things than variables that we can name ourselves.

There are conventions for naming a variable to distinguish variables from those other things:

  • Lower casing
    • Variable names must only contain lowercase characters
# Wrong:
Number = 1

# Right:
number = 1
  • Leading numbers are forbidden
    • Variable names must not start with a number (0-9)
# Wrong:
0var = 0

# Right:
zero_var = 0
  • Multi-word separation with underscores
    • Variable names comprised of multiple words must separate single words with underscores
# Totally wrong:
my first var = 0

# Still wrong:
MyFirstVar = 0

# Right:
my_first_var = 0

And most importantly¶

  • ALWAYS (¡ with NO exceptions !) name your variables in a semantically meaningful way. So spare your future-you, your collaborators (and your supervisors), the pain of guessing the purpose of your variables.
# Wrong:
var1 = 100
var2 = 20
var3 = var1 * var2

# Right:
number_of_books_sold = 100
price = 20
revenue = number_of_books_sold * price

Datatypes¶

Datatypes determine how you can process your data.

You already saw two datatypes in this notebook: Intgeres (-> Whole numbers) and Strings (-> Sequences of characters, (i.e., text)).

But there are (a few) more datatypes:

Numerical datatypes¶

Integers¶

  • They store whole - positive or negative - numbers.

Floats¶

  • They store - positive or negative - decimal numbers.

Both Integer and floats support numerical operations:

In [13]:
my_first_integer = 100
my_first_float = 0.1456
print(type(my_first_integer), type(my_first_float))
<class 'int'> <class 'float'>
In [2]:
(1 + 1) * 2 / (2//2)
Out[2]:
4.0
In [14]:
print(1 + 1)
print(2 - 1)
print(5 / 2)
print(5 // 2)
print(2 * 2)
print(5 % 2)
2
1
2.5
2
4
1

As you can see by the example 5 / 2, sometimes the result of a computation with two integers can result in a float. This is because python automatically typecasts numerical variables if the results require it.

It's also perfectly fine to calculate with floats and integers, the result will always be a float!

In [15]:
number_of_books_sold = 100
price = 19.99

revenue = number_of_books_sold * price
print(revenue)
1998.9999999999998
In [16]:
# Even if the result could theoretically be stored as an integer without any loss of information.

number_of_books_sold = 100
price = 20.0

revenue = number_of_books_sold * price
print(revenue)
2000.0

As already you saw before, you can also manually typecast your variables.

In [17]:
price = int(19.99)
number_of_books_sold = float(100)
print(price, number_of_books_sold)
19 100.0

Floats can also be properly rounded using the round-function.

In [18]:
approx_price = round(19.99)
approx_numer_of_book_sold = round(100.49)
print(approx_price, approx_numer_of_book_sold)
20 100

Sequence datatypes¶

  • Sequence datatypes not only store single value, but multiple values
  • Its elements are ordered in a fixed manner.
  • Sequence datatypes allow accessing single values, or mulitple values.

Strings¶

  • Strings store a sequence of (Unicode-encoded) characters.
  • They are wrapped in single, or double quotes.
In [4]:
message_english = "Hello World. 😊"
message_japanese = 'こんにちは'
message_arabic = "مرحبا بالعالم"
message_ancient_greek = 'Μέγα χαίρετε'
In [6]:
print(
    message_english,
    message_japanese,
    message_arabic,
    message_ancient_greek,
    sep="\n" # What does this mean?
)
Hello World. 😊
こんにちは
مرحبا بالعالم
Μέγα χαίρετε

Strings special characters¶

  • Text control sequences are encoded with special characters (starting with a backslash)
  • They mainly control the whitespaces other than simple spaces
In [21]:
print("Hello.\nDow you know what the \\n character is good for?\n\n\tBest,\n\tA friend of yours")
Hello.
Dow you know what the \n character is good for?

	Best,
	A friend of yours
In [22]:
newline = "\n"
tabulator = "\t"
# And some others...
  • Strings with triple quotes allow two use linebreaks or other control sequencces directly.
In [23]:
longer_text = """
My dear fried,
It has been a hell of a week!
Let me tell you the story of I how   REALLY   fucked up:
    [...]
Best,
A friend
"""
print(repr(longer_text))
'\nMy dear fried,\nIt has been a hell of a week!\nLet me tell you the story of I how   REALLY   fucked up:\n    [...]\nBest,\nA friend\n'
In [24]:
print(longer_text)
My dear fried,
It has been a hell of a week!
Let me tell you the story of I how   REALLY   fucked up:
    [...]
Best,
A friend

  • It's possible to concatenate two strings into one.
In [8]:
message = "Hello "
name = "Lennart"
print(message + name)
Hello Lennart
  • For longer strings this quickly becomes tedious, better use f-strings!
In [10]:
print(f"Hello {name} 1 + 1 = {1 + 1}")
Hello Lennart 1 + 1 = 2

Working with strings¶

Especially, if you do NLP you often have to process texts stored as strings (tokenizing, cleaning, etc.).

Luckily, Python provides many methods (i.e., functions directly bound to a string) to help you with that!

In [27]:
print(*[m for m in dir("") if not m.startswith("_")], sep="\t")
capitalize	casefold	center	count	encode	endswith	expandtabs	find	format	format_map	index	isalnum	isalpha	isascii	isdecimal	isdigit	isidentifier	islower	isnumeric	isprintable	isspace	istitle	isupper	join	ljust	lower	lstrip	maketrans	partition	removeprefix	removesuffix	replace	rfind	rindex	rjust	rpartition	rsplit	rstrip	split	splitlines	startswith	strip	swapcase	title	translate	upper	zfill
In [28]:
print("var1".removeprefix("var"))
print("My|spacebar|is|broken|😞".split("|"))
print("123".isnumeric())
print("001110000111111".count("1"))
# Any so much more!
1
['My', 'spacebar', 'is', 'broken', '😞']
True
9

-> We don't have the time to go through those methods in detail, but your are strongly advised to make yourself familiar with them. They can save you a lot of time!

The best way to go through them is to visit the documentation.

Lists¶

  • A list flexible, dynamic ordered container for multiple other datatypes
In [12]:
my_list = [0, "1", 2.0, [3]]
print(my_list)
[0, '1', 2.0, [3]]

Lists can dynamically be changed, by:

  • Adding an element
In [13]:
my_list = []
my_list.append(1)
print(my_list)
[1]
  • Adding multiple elements at once
In [14]:
my_list.extend([2, 3])
print(my_list)
[1, 2, 3]
  • Removing an element (by its position)
In [15]:
my_list.pop(0)
print(my_list)
[2, 3]
In [16]:
# Actually, my_list.pop not only removes the element at position 0 (i.e., the first element in the list)
# But it also returns it so that you can save it in another variable
new_list = [1, 2, 3]
removed_elemt = new_list.pop(0)
print(removed_elemt, new_list)
1 [2, 3]
  • Removing an element (by its value)
In [17]:
my_list.remove(2)
print(my_list)
[3]
  • Concatenating two lists into one
In [20]:
my_second_list = [4, 5]
my_super_list = my_list + my_second_list
print(my_super_list)
print(my_list)
print(my_second_list)
[3, 4, 5]
[3]
[4, 5]
  • Sorting the list
In [21]:
my_list = [3, 1, 2]
my_sorted_list = sorted(my_list)
print(my_sorted_list)
print(my_list)
[1, 2, 3]
[3, 1, 2]
In [36]:
my_list = ["Banana", "Ape", "Cat"]
my_list = sorted(my_list)
my_list
Out[36]:
['Ape', 'Banana', 'Cat']

Tuples¶

  • Tuples behave like lists, but they are immutable
  • Immutable objects can't be changed after their creation
In [37]:
my_tuple = (0, "1", 2.0, [3])
print(my_tuple)
(0, '1', 2.0, [3])

But:

In [38]:
my_tuple.append(4)
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
/var/folders/tx/s3wb7bvx4blgb8xnlhfxvw7h0000gn/T/ipykernel_3250/3420063398.py in <module>
----> 1 my_tuple.append(4)

AttributeError: 'tuple' object has no attribute 'append'

Indexing & Slicing¶

Strings, lists, and tuples share two common way of accessing their content: Slicing and Indexing

Indexing¶

  • Access single entries of sequence datatypes via their index
In [39]:
my_list = [1, 2, 3, 4, 5, 6]
print(my_list[0], my_list[-1])

my_string = "Hello World. 🙄"
print(my_string[0], my_string[-1])

my_tuple = (1, 2, 3)
print(my_tuple[0], my_tuple[-1])
1 6
H 🙄
1 3

Special case for lists:

  • It is also possible to change an entry of a list via its index
In [40]:
my_list[-1] = 1
print(my_list)
[1, 2, 3, 4, 5, 1]

Slicing¶

  • Slicing enables you to slice out subsequences by stating a start- and end-index
    • The end-index is exclusive, meaning that the slice stops on element before before it.
In [41]:
my_list = [1, 2, 3, 4, 5, 6]
print(my_list[0:2], my_list[:2], my_list[-2:])

my_string = "Hello World. 🙄"
print(my_string[0:2], my_string[:2], my_string[-2:])

my_tuple = (1, 2, 3)
print(my_tuple[0:2], my_tuple[:2], my_tuple[-2:])
[1, 2] [1, 2] [5, 6]
He He  🙄
(1, 2) (1, 2) (2, 3)

Once more, a special case for lists:

  • It is also possible to change an subsequzence within via slicing
In [42]:
my_list[:3] = [0, 0, 0]
print(my_list)
[0, 0, 0, 4, 5, 6]
  • Also it is possible to use a custom step size for your slice
In [43]:
my_list = [1, 2, 3, 4, 5, 6]
print(my_list[:4:2], my_list[::2], my_list[::-1])
[1, 3] [1, 3, 5] [6, 5, 4, 3, 2, 1]

Other shared properties of sequence datatypes¶

in-keyword to search the content¶

In [22]:
print(1 in [1, 2, 3])
print(2 in (1, 2, 3))
print("Hello" in "Hello World 😊")
True
True
True

Dictionaries¶

  • Dictionaries provide an easy-to-use data structure for efficiently storing - and even more important retrieving - key-value pairs.
  • They are Python's implementation of a HashTable.

Source: https://khalilstemmler.com/img/blog/data-structures/hash-tables/hash-table.png

In [45]:
my_dictionary = {
   # Key  :  Value 
    "Beer": "Bier",
    "Apple": "Apfel",
    "House": "Haus"
}
  • Using the key you can access the value (in constant time)
In [46]:
print(my_dictionary["House"])
Haus
In [23]:
d  = {0: 1}
d[0]
Out[23]:
1
  • To access the values, use the keys-method
In [47]:
print(my_dictionary.keys())
dict_keys(['Beer', 'Apple', 'House'])
  • To access the values, use the value-method
In [48]:
print(my_dictionary.values())
dict_values(['Bier', 'Apfel', 'Haus'])
  • To access the pairs, use the items-method
In [49]:
print(my_dictionary.items())
dict_items([('Beer', 'Bier'), ('Apple', 'Apfel'), ('House', 'Haus')])

Unlike, lists, tuples, or strings, dictionaries have no inherent order, meaning that it is not possible to access its contents via indexing or slicing.

Why should i use dictionaries?¶

Dictionaries help store a mapping from one unary element to another and are great for storing a collection of related data and using the keys as labels.

In [50]:
label_encoding = {
    "Polite": 0,
    "Mildly aggressive": 1,
    "Aggressive": 2,
    "Hateful": 3
}
print(label_encoding["Hateful"])
3
In [51]:
dataset = {
    "x": [0, 1, 0, 0, 1, 0, 1],
    "y": [0.5, 1.5, 0.5, 0.5, 1.5, 0.5, 1.5]
}
print(dataset["x"])
[0, 1, 0, 0, 1, 0, 1]

Booleans¶

Booleans are the leanest datatype in Python. Their value range is restricted to two states: True and False

In [24]:
am_i_right = True
is_this_hard = False
  • Numerical datatypes can be used with the most common arithmetic operators. Booleans can be used with the common logic operators

Quiztime

True and False
True or False
not True
not False
  • Booleans are also the results of other kinds of comparisons
In [25]:
print(0 < 1)
print(1 == 1)
print(0 > 1)
print(0 >= 1)
True
True
False
False

Conditions¶

Conditions make up one of the pillars of programming. They allow you to use Python not only as a calculator and to design your programs to detect and act on specific configurations of your data.

In Python there are four general types of conditions:

  • Single if-statements: Just act if a certain condition applies otherwise do nothing.
In [30]:
# [Standard programm flow]
revenue = 1000
if revenue > 750:
    print("What a good day!")
    print("Go on!")
print("Closing the shop!")
# [Standard programm proceeds]
What a good day!
Go on!
Closing the shop!
  • if-else statements: Execute if-Clausel code if the condition is met, otherwise execute else-block
In [55]:
# [Standard programm flow]
revenue = 1000
if revenue > 1500:
    print("What a good day!")
else:
    print("Alert, we should change somehting!")
# [Standard programm proceeds]
Alert, we should change somehting!
  • if-elif-elsestatements: Check on multiple conditions and execute the one that is met, otherwise execute else-block
In [56]:
revenue = 867
if revenue < 500:
    print("Horrendous day")
elif revenue > 500 and revenue < 750:
    print("Bad day")
elif revenue > 750 and revenue < 1000:
    print("We did okay")
else:
    print("We did great")
We did okay
  • Note that if-elif-else statements finish after the first true condition is met. If you want multiple conditions to be checked sequentially, you can chain various single if statements.
In [57]:
text = "Hello\nmy mobile number is 01234\ngreetings Lennart"

words = text.lower().split()

politeness_score = 0
if words[0] in ["hello", "ciao", ...]:
    # If there is a greeting, increase politeness score
    politeness_score += 1
if words[-2] == "greetings":
    # If there is a salute, also increase politeness score
    politeness_score += 1
if "moron" in words:
    # If there is an insult, we set politeness score to zero..
    politeness_score = 0

print(f"Message:\n'{text}'\nachieved a politeness score of {politeness_score}")
Message:
'Hello
my mobile number is 01234
greetings Lennart'
achieved a politeness score of 2

Loops¶

Often your programs have to iteratively apply operations to your data to achieve the desired results. Python provides two types of loops to repeat operations.

For-loop¶

For-loops serve two purposes:

  • Applying an operation to each element in a sequence.
In [58]:
my_list = [1, 2, 3]
my_new_list = []

for elem in my_list:
    new_element = elem * 2
    my_new_list.append(new_element)

    print(my_new_list)
[2]
[2, 4]
[2, 4, 6]
In [59]:
for charachter in "Mobile number: 123":
    print(charachter.isnumeric())
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
True
True
True
  • Repeating something exactly $N$ times.
In [60]:
x = 2
for run in range(8):
    x = x * 2

print(x)
512

While-loops¶

  • While loops are repeated until a certain condition is not satisfied anymore.
In [61]:
x = 10
while x > 0:
    print(x)
    x = x - 1
10
9
8
7
6
5
4
3
2
1
  • Be aware of the possibility of creating endless-loops by not updating the variables for the condition appropriately.
x = 10
while x > 0:
    print(x)
    x = x + 1

Break the loops¶

  • Sometimes, it might be necessary to leave a loop earlier (e.g., if a special condition applies)
  • Both for- and while-loops can be exited with the break-keyword
In [62]:
from time import sleep
for i in range(10):
    sleep(1)
    do_continue = input("Do you want to keep on waiting for Godot? (y/n) ")
    if do_continue.lower() in ("n", "no"):
        print("Maybe he'll come tomorrow!")
        break
Do you want to keep on waiting for Godot? (y/n) yes
Do you want to keep on waiting for Godot? (y/n) no
Maybe he'll come tomorrow!
  • Using the break-keyword you also can built do-while like loops
In [63]:
my_number = 1
while True:
    my_number += 1
    if my_number >= 1:
        break
print(my_number)
2