Kursthemen

  • Allgemeines

  • General Information

    • What is the Course About?

      The aim of this course is to introduce students of humanities and social sciences to fundamental concepts of data science and equip them with practical data science skills. In this course, students will be exposed to a wide breadth of data science concepts, at an introductory level and in a gentle pace. The course will cover the theoretical aspects of the fundamental data science methods, as well as practical usage scenarios (i.e., research problems) for these methods in humanities and social sciences. The students will analyze the strengths and limitations of the covered data science methods, and critically reflect on their applicability and potential impact in concrete research problems in humanities and social sciences.

      The practical part of the course will be based on the Data Science Stack of the Python programming language – the most widely used programming language among data scientists. Throughout the course, the students will be familiarized with different data analysis building blocks (and corresponding functionality in Python) and encouraged to creatively combine these individual building blocks to address (potentially complex) research questions from the disciplines of humanities and social sciences.



    • Learning outcomes

      Upon successful completion of the course, students will be able to design and implement a quantitative, data-driven methodological approach for a given research problem in humanities. More concretely, the students will be able to:

      • Recognize the aspects of the research problem that can be addressed or answered with quantitative data science methods and discern them from the aspects that can only be subdued to qualitative (i.e., substantive) analysis;

      • Obtain existing data (e.g., by scraping public content from the web) and collect the data from scratch (e.g., via surveys or crowdsourcing) for the specific research problem at hand, with privacy and intellectual property considerations in mind;

      • Identify the most suitable data analysis approach (or a combination of approaches) for the research problem at hand (e.g., distinguish problems that require descriptive data analysis from those calling for predictive algorithms).

      • Within each family of data science methods (e.g., predictive approaches), select those that are most suitable given the characteristics of the available data;

      • Select the data cleaning and preparation methods that best correspond to the given type of data; Clean and preprocess the data to make it conform to the type of input that the selected data analysis algorithms require;

      • Produce and visualize the results of the data analysis;

      • Scrutinize the data science methods and results obtained with those methods in terms of significance, fairness, and interpretability;


  • Organizational matters

    •  

       

       

      Language of the Course

      The lectures will be partially in English and German.

      The material will be in English.

      Questions (in person and the Forum) can be asked in both languages.

      Proof of performance

      During the Course: Active participation and weekly assignments.

      Final examination: 

       

      As proof of performance, you conduct a data science research project and document it in a report.

      Your project needs to complete the following mandatory stages:

      • Defining one or multiple research question(s)
      • Dataset assembly/ annotation
      • Data cleaning/ preprocessing
      • Dataset description/ exploratory analysis
      • Experimental setup
      • Experiments
      • Analysis of results
      • Conclusion

      Your report should be 10-12 pages long (without images or plots) and outline all the steps you've taken and, of course, the results.

      If you have any questions, especially about the requirements or difficulties finding a dataset or topic, contact Lennart (lennart.keller@uni-wuerzburg.de) as soon as possible.

       

      Schedule and location

      To be announced.

       

      Additional information

      Since each session contains hands-on parts, students are required to bring their laptops. Unfortunately, the CIP-Pools machines do not include the necessary software, so they cannot be used as a substitute.

      Furthermore, a working Python Stack (easiest to set up using Anaconda) has to be installed on your machines.

      If Anaconda is not yet installed on your machine, please install it before our first meeting so that we can solve eventual problems quickly together in the first lecture.

      Please also make sure to connect your machine to the university's official Wi-Fi (eduroam) since Bayern-WLAN is remarkably unreliable at times.

       

  • Part 1: Introduction to Data Science for Humanities

    • Lecture content: What is data science, and why should humanities care about it? Qualitative vs. quantitative research; What is Computational Humanities research? How to represent “cultures and societies” by “data”? Scope and content of the course; Organizational aspects of the course (schedule, grading, infrastructure).


      Tutorial content: Introduction to Python and its Data Science Stack; Installing and configuring interactive Python notebooks (Jupyter Notebooks).


      Homework: Programming exercises – Very Basics of Python programming. Basic data structures (int, float, str, + list, dict, tuples) and control structures (if-else, while, for)

  • Part 2: (Re-)introduction to Python

    • Lecture content: Basic programming paradigms (OOP, functional)

      Tutorial content: Functions, classes, and inheritance; list expressions

      Homework: Programming exercises – Data processing with list expressions and data modeling with classes




  • Part 3: Data modeling for data science

    • Lecture content: Vectorization of data -- texts and images as vectors of numbers (Bag-of-words model, RGB images as 3-dimensional arrays), vectorized operations.

      Tutorial content: Light introduction to Numpy arrays (motto: “From lists to arrays.”)

      Homework: Implement a text vectorizer (similar to CountVectorizer class from scikit-learn) to convert texts into a document-term matrix.



  • Part 4: Data Acquisition and Preparation

    • Lecture content: Data collection and data acquisition: designing data collection surveys,

      collecting publicly available data from the web (scraping, public APIs), crowdsourcing;

      Types of data: structured, semi-structured, unstructured; Data preparation, preprocessing, and cleaning: error correction, deduplication, normalization, handling missing values;

      Data privacy and intellectual property rights.

      Tutorial content: Scraping and extracting public content from the Web (Python libraries: scrapy and tweepy); Data loading, organization, preparation, formatting, and manipulation (Python libraries: pandas);

      Homework: Usage scenario – Correction of object character recognition (OCR) errors


  • Part 5: Explorative Analysis 1 – Descriptive Analysis and visualization

    • Lecture content: Descriptive statistics: univariate data analysis (measures of central tendency, variability, skewness, and kurtosis), bivariate and multivariate data analysis (covariance, correlation), Data visualization (scatter and box plots, histograms).

      Tutorial content: Computing means, variances, and correlations (Python libraries: numpy and scipy); Plotting data points and visualizing data analysis results (Python libraries: matplotlib and seaborn).

      Homework: Basic (visual) exploratory description of a given dataset.

  • Part 6: Explorative Analysis 2 – Clustering and distance functions

    • Lecture content: Basics of vector spaces (in a practical, not strictly mathematical sense; e.g., vectors are points in an n-dim space...) and distance functions. 

      Clustering algorithms (KMeans, hierarchical clustering, (potentially also DBSCAN))

      Tutorial content: Fundamentals and philosophy of the scikit-learn API. Transforming data and using the clustering algorithms.

      Homework: Usage scenario – Clustering medieval scripts using computer image analysis (and different clustering algorithms).


  • Part 7: Predictive Analysis (A Gentle Introduction to Machine Learning)

    • Lecture content: Basics concepts of (predictive) learning theory: inductive bias, generalization, overfitting; Traditional machine learning models: Logistic Regression, Naïve Bayes, k -Nearest Neighbours, Decision Trees, Support Vector Machines. Data splits and model selection. Evaluation metrics (accuracy, precision, recall) -- strengths and weaknesses. 

      Tutorial content: Training and evaluating concrete machine learning models on concrete datasets; Performing hyperparameter tuning and feature selection via cross-validation (Python libraries: scikit-learn).

      Homework: Usage scenario – Music genre classification. And to read a short paper on the implications of employing machine learning models in the humanities: Underwood, T. (2018). Algorithmic Modeling. Abingdon, Oxon ; New York, NY : Routledge,.

  • Part 8: Text and Language I (Computational Linguistics)

    • Lecture content: Corpus linguistics: frequency distributions and Zipf’s law, lexical association measures, collocation, and terminology extraction; Lexico-semantic resources (Word-Net); Basics of computational linguistics: morphological normalization (inflectional and derivational morphology), part-of-speech tagging, syntactic parsing; Topic modeling.

      Tutorial content: Extracting collocations (Python libraries: nltk and spacy); Analyzing lexico-semantic knowledge in WordNet (Python library: wordnet interface from nltk); Topic modeling with latent Dirichlet allocation (Python library: gensim).

      Homework: Usage scenario – Topical exploration of the American “Lost Generation” literature (distant reading).

  • Part 9: Text and Language II (Natural Language Processing)

    • Lecture content: Language modeling and distributional semantics; Sparse and dense text representations for natural language processing; Information extraction – recognizing mentions of entities and events in text; 

      Tutorial content: Sparse text representations (Python library: scikit-learn) and dense text representations (Python library: gensim); Extracting named entities from text in different languages (Python library: spacy); Text classification with traditional (logistic regression) (Python libraries: scikit-learn).

      Homework : Usage scenario – Detecting hate speech in social media posts. Interpreting a (simple linear) model: Which terms are strong indicators of a class?

  • Part 10: Guest lecture