Kurs: WS23: Data Science for Digital Humanities 1

Abschnittsübersicht

Abschnitt Allgemeines auswählen

Allgemeines

Alles einklappen Alles aufklappen
- Aktivität Ankündigungen auswählen
  
  Ankündigungen Forum
Abschnitt General Information auswählen

General Information
- Aktivität What is the Course About?The aim of this course is... auswählen
  
  What is the Course About?
  The aim of this course is to introduce students of humanities and social sciences to fundamental concepts of data science and equip them with practical data science skills. In this course, students will be exposed to a wide breadth of data science concepts, at an introductory level and in a gentle pace. The course will cover the theoretical aspects of the fundamental data science methods, as well as practical usage scenarios (i.e., research problems) for these methods in humanities and social sciences. The students will analyze the strengths and limitations of the covered data science methods, and critically reflect on their applicability and potential impact in concrete research problems in humanities and social sciences.
  The practical part of the course will be based on the Data Science Stack of the Python programming language – the most widely used programming language among data scientists. Throughout the course, the students will be familiarized with different data analysis building blocks (and corresponding functionality in Python) and encouraged to creatively combine these individual building blocks to address (potentially complex) research questions from the disciplines of humanities and social sciences.
- Aktivität Learning outcomesUpon successful completion of the... auswählen
  
  Learning outcomes
  Upon successful completion of the course, students will be able to design and implement a quantitative, data-driven methodological approach for a given research problem in humanities. More concretely, the students will be able to:
  Recognize the aspects of the research problem that can be addressed or answered with quantitative data science methods and discern them from the aspects that can only be subdued to qualitative (i.e., substantive) analysis;
  Obtain existing data (e.g., by scraping public content from the web) and collect the data from scratch (e.g., via surveys or crowdsourcing) for the specific research problem at hand, with privacy and intellectual property considerations in mind;
  Identify the most suitable data analysis approach (or a combination of approaches) for the research problem at hand (e.g., distinguish problems that require descriptive data analysis from those calling for predictive algorithms).
  Within each family of data science methods (e.g., predictive approaches), select those that are most suitable given the characteristics of the available data;
  Select the data cleaning and preparation methods that best correspond to the given type of data; Clean and preprocess the data to make it conform to the type of input that the selected data analysis algorithms require;
  Produce and visualize the results of the data analysis;
  Scrutinize the data science methods and results obtained with those methods in terms of significance, fairness, and interpretability;
Abschnitt Organizational matters auswählen

Organizational matters
- Aktivität Language of the Course The lectures will be partia... auswählen
  
  Language of the Course
  
  The lectures will be partially in English and German.
  
  The material will be in English.
  
  Questions (in person and the Forum) can be asked in both languages.
  
  Proof of performance
  
  During the Course: Active participation and weekly assignments.
  
  Final examination:
  
  As proof of performance, you conduct a data science research project and document it in a report.
  
  Your project needs to complete the following mandatory stages:
  
  Defining one or multiple research question(s)
  
  Dataset assembly/ annotation
  
  Data cleaning/ preprocessing
  
  Dataset description/ exploratory analysis
  
  Experimental setup
  
  Experiments
  
  Analysis of results
  
  Conclusion
  
  Your report should be 10-12 pages long (without images or plots) and outline all the steps you've taken and, of course, the results.
  
  If you have any questions, especially about the requirements or difficulties finding a dataset or topic, contact Lennart (lennart.keller@uni-wuerzburg.de) as soon as possible.
  
  Schedule and location
  
  To be announced.
  
  Additional information
  
  Since each session contains hands-on parts, students are required to bring their laptops. Unfortunately, the CIP-Pools machines do not include the necessary software, so they cannot be used as a substitute.
  
  Furthermore, a working Python Stack (easiest to set up using Anaconda) has to be installed on your machines.
  
  If Anaconda is not yet installed on your machine, please install it before our first meeting so that we can solve eventual problems quickly together in the first lecture.
  
  Please also make sure to connect your machine to the university's official Wi-Fi (eduroam) since Bayern-WLAN is remarkably unreliable at times.
Abschnitt Part 1: Introduction to Data Science for Humanities auswählen

Part 1: Introduction to Data Science for Humanities
- Aktivität Lecture content: What is data science, and why sho... auswählen
  
  Lecture content: What is data science, and why should humanities care about it? Qualitative vs. quantitative research; What is Computational Humanities research? How to represent “cultures and societies” by “data”? Scope and content of the course; Organizational aspects of the course (schedule, grading, infrastructure).
  
  Tutorial content: Introduction to Python and its Data Science Stack; Installing and configuring interactive Python notebooks (Jupyter Notebooks).
  
  Homework: Programming exercises – Very Basics of Python programming. Basic data structures (int, float, str, + list, dict, tuples) and control structures (if-else, while, for)
- Aktivität Notebook: Python-Basics (HTML-Version) auswählen
  
  Notebook: Python-Basics (HTML-Version) Datei
- Aktivität Kick-off slides auswählen
  
  Kick-off slides Datei
Abschnitt Part 2: (Re-)introduction to Python auswählen

Part 2: (Re-)introduction to Python
- Aktivität Lecture content: Basic programming paradigms (OOP,... auswählen
  
  Lecture content: Basic programming paradigms (OOP, functional)
  Tutorial content: Functions, classes, and inheritance; list expressions
  Homework: Programming exercises – Data processing with list expressions and data modeling with classes
- Aktivität Exercise 1: Python basics auswählen
  
  Exercise 1: Python basics Aufgabe
- Aktivität Notebook: Basics 2 (HTML-Version) auswählen
  
  Notebook: Basics 2 (HTML-Version) Datei
Abschnitt Part 3: Data modeling for data science auswählen

Part 3: Data modeling for data science
- Aktivität Lecture content: Vectorization of data -- texts an... auswählen
  
  Lecture content: Vectorization of data -- texts and images as vectors of numbers (Bag-of-words model, RGB images as 3-dimensional arrays), vectorized operations.
  
  Tutorial content: Light introduction to Numpy arrays (motto: “From lists to arrays.”)
  Homework: Implement a text vectorizer (similar to CountVectorizer class from scikit-learn) to convert texts into a document-term matrix.
- Aktivität Exercise: Numpy and Bag-of-words model auswählen
  
  Exercise: Numpy and Bag-of-words model Aufgabe
- Aktivität Notebook: Numpy (HTML-Version) auswählen
  
  Notebook: Numpy (HTML-Version) Datei
- Aktivität Solution: Bag-of-words auswählen
  
  Solution: Bag-of-words Datei
- Aktivität Solution: Python basics auswählen
  
  Solution: Python basics Datei
Abschnitt Part 4: Data Acquisition and Preparation auswählen

Part 4: Data Acquisition and Preparation
- Aktivität Lecture content: Data collection and data acquisit... auswählen
  
  Lecture content: Data collection and data acquisition: designing data collection surveys,
  collecting publicly available data from the web (scraping, public APIs), crowdsourcing;
  Types of data: structured, semi-structured, unstructured; Data preparation, preprocessing, and cleaning: error correction, deduplication, normalization, handling missing values;
  Data privacy and intellectual property rights.
  Tutorial content: Scraping and extracting public content from the Web (Python libraries: scrapy and tweepy); Data loading, organization, preparation, formatting, and manipulation (Python libraries: pandas);
  Homework: Usage scenario – Correction of object character recognition (OCR) errors
- Aktivität Notebook: Acquistion and Preprocessing (HTML-Version) auswählen
  
  Notebook: Acquistion and Preprocessing (HTML-Version) Datei
- Aktivität Dataset: Olympics auswählen
  
  Dataset: Olympics Datei
Abschnitt Part 5: Explorative Analysis 1 – Descriptive Analysis and visualization auswählen

Part 5: Explorative Analysis 1 – Descriptive Analysis and visualization
- Aktivität Lecture content: Descriptive statistics: univariat... auswählen
  
  Lecture content: Descriptive statistics: univariate data analysis (measures of central tendency, variability, skewness, and kurtosis), bivariate and multivariate data analysis (covariance, correlation), Data visualization (scatter and box plots, histograms).
  Tutorial content: Computing means, variances, and correlations (Python libraries: numpy and scipy); Plotting data points and visualizing data analysis results (Python libraries: matplotlib and seaborn).
  Homework: Basic (visual) exploratory description of a given dataset.
- Aktivität Notebook: Descriptive Statistics (HTML-Version) auswählen
  
  Notebook: Descriptive Statistics (HTML-Version) Datei
- Aktivität Notebook: Deep-dive into Seaborn (HTML-Version) auswählen
  
  Notebook: Deep-dive into Seaborn (HTML-Version) Datei
- Aktivität Homework: Gather a dataset. auswählen
  
  Homework: Gather a dataset. Aufgabe
Abschnitt Part 6: Explorative Analysis 2 – Clustering and distance functions auswählen

Part 6: Explorative Analysis 2 – Clustering and distance functions
- Aktivität Lecture content: Basics of vector spaces (in a pra... auswählen
  
  Lecture content: Basics of vector spaces (in a practical, not strictly mathematical sense; e.g., vectors are points in an n-dim space...) and distance functions.
  Clustering algorithms (KMeans, hierarchical clustering, (potentially also DBSCAN))
  Tutorial content: Fundamentals and philosophy of the scikit-learn API. Transforming data and using the clustering algorithms.
  Homework: Usage scenario – Clustering medieval scripts using computer image analysis (and different clustering algorithms).
- Aktivität Notebook: Clustering and Distance Functions (HTML) auswählen
  
  Notebook: Clustering and Distance Functions (HTML) Datei
- Aktivität Notebook: Clustering and distance function (.ipynb) auswählen
  
  Notebook: Clustering and distance function (.ipynb) Datei
Abschnitt Part 7: Predictive Analysis (A Gentle Introduction to Machine Learning) auswählen

Part 7: Predictive Analysis (A Gentle Introduction to Machine Learning)
- Aktivität Lecture content: Basics concepts of (predictive) l... auswählen
  
  Lecture content: Basics concepts of (predictive) learning theory: inductive bias, generalization, overfitting; Traditional machine learning models: Logistic Regression, Naïve Bayes, k -Nearest Neighbours, Decision Trees, Support Vector Machines. Data splits and model selection. Evaluation metrics (accuracy, precision, recall) -- strengths and weaknesses.
  Tutorial content: Training and evaluating concrete machine learning models on concrete datasets; Performing hyperparameter tuning and feature selection via cross-validation (Python libraries: scikit-learn).
  Homework: Usage scenario – Music genre classification. And to read a short paper on the implications of employing machine learning models in the humanities: Underwood, T. (2018). Algorithmic Modeling. Abingdon, Oxon ; New York, NY : Routledge,.
- Aktivität Notebook: Classification (.ipynb) auswählen
  
  Notebook: Classification (.ipynb) Datei
- Aktivität Data: reviews_train.csv auswählen
  
  Data: reviews_train.csv Datei
- Aktivität Data: reviews_text.csv auswählen
  
  Data: reviews_text.csv Datei
- Aktivität Text Classification and Clustering: Slides auswählen
  
  Text Classification and Clustering: Slides Datei
Abschnitt Part 8: Text and Language I (Computational Linguistics) auswählen

Part 8: Text and Language I (Computational Linguistics)
- Aktivität Lecture content: Corpus linguistics: frequency dis... auswählen
  
  Lecture content: Corpus linguistics: frequency distributions and Zipf’s law, lexical association measures, collocation, and terminology extraction; Lexico-semantic resources (Word-Net); Basics of computational linguistics: morphological normalization (inflectional and derivational morphology), part-of-speech tagging, syntactic parsing; Topic modeling.
  Tutorial content: Extracting collocations (Python libraries: nltk and spacy); Analyzing lexico-semantic knowledge in WordNet (Python library: wordnet interface from nltk); Topic modeling with latent Dirichlet allocation (Python library: gensim).
  Homework: Usage scenario – Topical exploration of the American “Lost Generation” literature (distant reading).
- Aktivität Notebook: Text Processing auswählen
  
  Notebook: Text Processing Datei
- Aktivität Slides: Lexical Semantics auswählen
  
  Slides: Lexical Semantics Datei
- Aktivität Slides: Text Representations auswählen
  
  Slides: Text Representations Datei
- Aktivität Slides: Information Extraction auswählen
  
  Slides: Information Extraction Datei
- Aktivität Data: Unlabeled Reviews auswählen
  
  Data: Unlabeled Reviews Datei
Abschnitt Part 9: Text and Language II (Natural Language Processing) auswählen

Part 9: Text and Language II (Natural Language Processing)
- Aktivität Lecture content: Language modeling and distributio... auswählen
  
  Lecture content: Language modeling and distributional semantics; Sparse and dense text representations for natural language processing; Information extraction – recognizing mentions of entities and events in text;
  Tutorial content: Sparse text representations (Python library: scikit-learn) and dense text representations (Python library: gensim); Extracting named entities from text in different languages (Python library: spacy); Text classification with traditional (logistic regression) (Python libraries: scikit-learn).
  Homework : Usage scenario – Detecting hate speech in social media posts. Interpreting a (simple linear) model: Which terms are strong indicators of a class?
Abschnitt Part 10: Guest lecture auswählen

Part 10: Guest lecture

Abschnittsübersicht

What is the Course About?

Learning outcomes

Language of the Course

Proof of performance

Schedule and location

Additional information