Lecture content: Data collection and data acquisition: designing data collection surveys,
collecting publicly available data from the web (scraping, public APIs), crowdsourcing;
Types of data: structured, semi-structured, unstructured; Data preparation, preprocessing, and cleaning: error correction, deduplication, normalization, handling missing values;
Data privacy and intellectual property rights.
Tutorial content: Scraping and extracting public content from the Web (Python libraries: scrapy and tweepy); Data loading, organization, preparation, formatting, and manipulation (Python libraries: pandas);
Homework: Usage scenario – Correction of object character recognition (OCR) errors