After this part you'll know about:
Pandas
Any type of data that is available in a fixed shape and that follows a rigorous data model, (often in form of tables).
The shape is determined by the nature of the data, and provides useful metadata.
Entries have a well-defined and fixed datatype
Transaction_ID: int | Product: str | Price: float | Quantity: int |
---|---|---|---|
0 | Beer | 0.89 | 6 |
0 | Chips | 1.99 | 1 |
1 | Milk | 1.20 | 3 |
2 | Bread | 2.55 | 1 |
... | ... | ... | ... |
Data that does not have a well-defined shape (or underlying data model).
Metadata or other characteristics are not explicitly marked.
The content might mix up different modalities (text, audio, images, videos, ...).
Data might be messy and contain errors, or erroneous artifacts.
<html>
<head>
...
</head>
<body>
<h1>Isn't HTML a well-defined data model?</h1>
<p>I mean it has: </p>
<ul>
<li>A <b>fixed</b> set of elements</li>
<li>Additional attributes to mark metadata</li>
</ul>
</body>
</html>
----------------
<html>
<head>
<script type="javascript">
$(document).ready(function() {loadContentForUser();});
</script>
</head>
<body>
<h1>Why HTML isn't a well defined data model:</h1>
<div class="bulletpoints">
<div class="bp align-left">HTML defines what which elements to use, but no how to use them</div>
<div>A strict and coherent form is never enforced or even guaranteed.</div>
<div class="bp-last align-center italic">HTML is quickly evolving, and nowadays often used to define a template to render (multimedia) content dynamically</div>
</div>
<img src="../like-button" on-click="updateLikeButton()">
<embed class="ContentForUser videos "></embed>
</body>
</html>
We call everything between unstructured and structured data semi-structured data.
Does not follow a well-defined data model, but contains markers indicating semantic structures.
But those markers do not explicitly enforce a certain datatype.
Structures might change.
messages = [
...
{
"type": "text-message",
"id": 1001,
"timestamp": "Mo 28 Nov 2022 10:56:35 CET",
"from": "Bob",
"to": "Alice",
"content": "Do you have any plans for dinner? 🧐 I found this cool recipe, wanna try it out?",
"attachment": "image_00001.png"
},
{
"type": "text-message",
"id": 1002,
"timestamp": "Mo 28 Nov 2022 10:59:46 CET",
"from": "Alice",
"to": "Bob",
"content": "No, not yet. But that looks delicious! I'm in!"
},
{
"type": "reaction",
"id": 1003,
"timestamp": "Mo 28 Nov 2022 11:01:00 CET",
"from": "Bob",
"to": 1002,
"reaction-type": "👍🏻"
}
...
]
DataFrame
¶Like Numpy's
ndarray
, there is one essential container class which builds the core of the library, the DataFrame
.
Conceptually, you can think of a DataFrame
as the Python version of a Spreadsheet.
Consider, the sales table from above:
Transaction_ID: int | Product: str | Price: float | Quantity: int |
---|---|---|---|
0 | Beer | 0.89 | 6 |
0 | Chips | 1.99 | 1 |
1 | Milk | 1.20 | 3 |
2 | Bread | 2.55 | 1 |
... | ... | ... | ... |
It can directly be represented as a DataFrame
:
import pandas as pd
columns = ["Transaction", "Product", "Price", "Quantity"]
data = [
[0, "Beer", 0.89, 6],
[0, "Chips", 1.99, 1],
[1, "Milk", 1.20, 3],
[2, "Bread", 2.55, 1],
]
df = pd.DataFrame(data=data, columns=columns)
df
Transaction | Product | Price | Quantity | |
---|---|---|---|---|
0 | 0 | Beer | 0.89 | 6 |
1 | 0 | Chips | 1.99 | 1 |
2 | 1 | Milk | 1.20 | 3 |
3 | 2 | Bread | 2.55 | 1 |
DataFrame consists of columns and rows.
Typically each column represents a feature, or field of your data. And each row represents a represents a record/ instance of your dataset.
df.shape, df.columns, df.index
((4, 4), Index(['Transaction', 'Product', 'Price', 'Quantity'], dtype='object'), RangeIndex(start=0, stop=4, step=1))
Pandas offers a wide variety of different way to access your data:
# Access columns
df["Price"]
0 0.89 1 1.99 2 1.20 3 2.55 Name: Price, dtype: float64
# Access rows
df.loc[0]
Transaction 0 Product Beer Price 0.89 Quantity 6 Name: 0, dtype: object
Also there is a wide variety of additional functions to manipulate your data, or create new aggregated views on it:
# Sorting based on specific columns
df.sort_values(by="Price", ascending=False)
Transaction | Product | Price | Quantity | |
---|---|---|---|---|
3 | 2 | Bread | 2.55 | 1 |
1 | 0 | Chips | 1.99 | 1 |
2 | 1 | Milk | 1.20 | 3 |
0 | 0 | Beer | 0.89 | 6 |
# Aggregating all rows with a specific value
df.groupby("Transaction")["Price"].sum()
Transaction 0 2.88 1 1.20 2 2.55 Name: Price, dtype: float64
df.groupby("Product")["Quantity"].mean()
Product Beer 6.0 Bread 1.0 Chips 1.0 Milk 3.0 Name: Quantity, dtype: float64
# Counting unique values within columns
df.Transaction.value_counts()
0 2 1 1 2 1 Name: Transaction, dtype: int64
# Saving data in a variety of commonly used formats
df.to_csv("sales.csv", index=False)
df.to_excel("sales.xlsx", index=False)
# Loading data in various formats
df = pd.read_excel("sales.xlsx")
df
Transaction | Product | Price | Quantity | |
---|---|---|---|---|
0 | 0 | Beer | 0.89 | 6 |
1 | 0 | Chips | 1.99 | 1 |
2 | 1 | Milk | 1.20 | 3 |
3 | 2 | Bread | 2.55 | 1 |
And so much more!
Pandas
¶Learning Pandas
- even more like Numpy
- is best done, while you work on your specific projects.
Going through all its features in the course would be too extensive. So we'll introduce additional features as we go on.
You are strongly encouraged to stay curious and always keep on googling "How to do xy in Pandas?" 😊
But there are some great resources to start digging deeper:
The vast amount of real word data has to be preprocessed before you starting to analyze it.
Common preprocessing steps may include but are not limited to operations like:
Let's go through them in detail.
We'll use a dataset containing the participation of countries in the Winter- and Summer Olympics.
df = pd.read_csv("olympics.csv")
print(df.shape)
df
(150, 1)
0;1;2;3;4;5;6;7;8;9;10;11;12;13;14;15 | |
---|---|
0 | ;? Summer;01 !;02 !;03 !;Total;? Winter;01 !;0... |
1 | Afghanistan (AFG);13;0;0;2;;0;0;0;0;0;13;0;0;2;2 |
2 | Algeria (ALG);12;5;2;8;15;3;0;0;0;0;15;5;2;8;15 |
3 | Argentina (ARG);23;18;24;28;70;18;0;0;0;0;41;1... |
4 | Armenia (ARM);5;1;2;9;12;6;0;0;0;0;11;1;2;9;12 |
... | ... |
145 | Independent Olympic Participants (IOP) ;1;0;1;... |
146 | Zambia (ZAM) ;12;0;1;1;2;0;0;0;0;0;12;0;1;1;2 |
147 | Zimbabwe (ZIM) ;12;3;4;1;8;1;0;;0;0;13;3;4;1;8 |
148 | Mixed team (ZZX) ;3;8;5;4;17;0;0;0;0;0;3;8;5;4;17 |
149 | Totals;27;4809;4775;5130;14714;22;959;958;948;... |
150 rows × 1 columns
As you can see, even though the file is a CSV-table (comma-separated-values), the file is not correctly loaded.
The reason is that instead of commas, the file's authors used semicolons as separator symbols.
Luckily, we can specify custom seperators in Pandas
.read_csv
function:
df = pd.read_csv("olympics.csv", sep=";")
df
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | NaN | ? Summer | 01 ! | 02 ! | 03 ! | Total | ? Winter | 01 ! | 02 ! | 03 ! | Total | ? Games | 01 ! | 02 ! | 03 ! | Combined total |
1 | Afghanistan (AFG) | 13 | 0 | 0 | 2 | NaN | 0 | 0 | 0 | 0 | 0 | 13 | 0 | 0 | 2 | 2 |
2 | Algeria (ALG) | 12 | 5 | 2 | 8 | 15 | 3 | 0 | 0 | 0 | 0 | 15 | 5 | 2 | 8 | 15 |
3 | Argentina (ARG) | 23 | 18 | 24 | 28 | 70 | 18 | 0 | 0 | 0 | 0 | 41 | 18 | 24 | 28 | 70 |
4 | Armenia (ARM) | 5 | 1 | 2 | 9 | 12 | 6 | 0 | 0 | 0 | 0 | 11 | 1 | 2 | 9 | 12 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
145 | Independent Olympic Participants (IOP) | 1 | 0 | 1 | 2 | 3 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 2 | 3 |
146 | Zambia (ZAM) | 12 | 0 | 1 | 1 | 2 | 0 | 0 | 0 | 0 | 0 | 12 | 0 | 1 | 1 | 2 |
147 | Zimbabwe (ZIM) | 12 | 3 | 4 | 1 | 8 | 1 | 0 | NaN | 0 | 0 | 13 | 3 | 4 | 1 | 8 |
148 | Mixed team (ZZX) | 3 | 8 | 5 | 4 | 17 | 0 | 0 | 0 | 0 | 0 | 3 | 8 | 5 | 4 | 17 |
149 | Totals | 27 | 4809 | 4775 | 5130 | 14714 | 22 | 959 | 958 | 948 | 2865 | 49 | 5768 | 5733 | 6078 | 17579 |
150 rows × 16 columns
Now it looks much better, but is also apparent that the table's header (i.e., first row) wasn't correctly identified.
Let's fix this:
df = pd.read_csv("olympics.csv", sep=";", header=1)
df
Unnamed: 0 | ? Summer | 01 ! | 02 ! | 03 ! | Total | ? Winter | 01 !.1 | 02 !.1 | 03 !.1 | Total.1 | ? Games | 01 !.2 | 02 !.2 | 03 !.2 | Combined total | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Afghanistan (AFG) | 13 | 0 | 0 | 2 | NaN | 0 | 0.0 | 0.0 | 0.0 | 0 | 13 | 0 | 0.0 | 2 | 2 |
1 | Algeria (ALG) | 12 | 5 | 2 | 8 | 15.0 | 3 | 0.0 | 0.0 | 0.0 | 0 | 15 | 5 | 2.0 | 8 | 15 |
2 | Argentina (ARG) | 23 | 18 | 24 | 28 | 70.0 | 18 | 0.0 | 0.0 | 0.0 | 0 | 41 | 18 | 24.0 | 28 | 70 |
3 | Armenia (ARM) | 5 | 1 | 2 | 9 | 12.0 | 6 | 0.0 | 0.0 | 0.0 | 0 | 11 | 1 | 2.0 | 9 | 12 |
4 | Australasia (ANZ) | 2 | 3 | 4 | 5 | 12.0 | 0 | 0.0 | 0.0 | 0.0 | 0 | 2 | 3 | 4.0 | 5 | 12 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
144 | Independent Olympic Participants (IOP) | 1 | 0 | 1 | 2 | 3.0 | 0 | 0.0 | 0.0 | 0.0 | 0 | 1 | 0 | 1.0 | 2 | 3 |
145 | Zambia (ZAM) | 12 | 0 | 1 | 1 | 2.0 | 0 | 0.0 | 0.0 | 0.0 | 0 | 12 | 0 | 1.0 | 1 | 2 |
146 | Zimbabwe (ZIM) | 12 | 3 | 4 | 1 | 8.0 | 1 | 0.0 | NaN | 0.0 | 0 | 13 | 3 | 4.0 | 1 | 8 |
147 | Mixed team (ZZX) | 3 | 8 | 5 | 4 | 17.0 | 0 | 0.0 | 0.0 | 0.0 | 0 | 3 | 8 | 5.0 | 4 | 17 |
148 | Totals | 27 | 4809 | 4775 | 5130 | 14714.0 | 22 | 959.0 | 958.0 | 948.0 | 2865 | 49 | 5768 | 5733.0 | 6078 | 17579 |
149 rows × 16 columns
df.columns
Index(['Unnamed: 0', '? Summer', '01 !', '02 !', '03 !', 'Total', '? Winter', '01 !.1', '02 !.1', '03 !.1', 'Total.1', '? Games', '01 !.2', '02 !.2', '03 !.2', 'Combined total'], dtype='object')
As you can see Pandas
automatically added prefixes to some column names, because it requires them to be unique.
This further adds further blurriness to the already confusing names, so let's rename them:
column_rename_map = {
'01 !': 'Gold-Summer',
'01 !.1': 'Gold-Winter',
'01 !.2': 'Gold-Total',
'02 !': 'Silver-Summer',
'02 !.1': 'Silver-Winter',
'02 !.2': 'Silver-Total',
'03 !': 'Bronze-Summer',
'03 !.1': 'Bronze-Winter',
'03 !.2': 'Bronze-Total',
'? Games': 'Participation-Total',
'? Summer': 'Participation-Summer',
'? Winter': 'Participation-Winter',
'Combined total': 'Combined-Medals',
'Total': 'Combined-Medals-Summer',
'Total.1': 'Combined-Medals-Winter',
'Unnamed: 0': 'Country'
}
df = df.rename(column_rename_map, axis=1)
df
Country | Participation-Summer | Gold-Summer | Silver-Summer | Bronze-Summer | Combined-Medals-Summer | Participation-Winter | Gold-Winter | Silver-Winter | Bronze-Winter | Combined-Medals-Winter | Participation-Total | Gold-Total | Silver-Total | Bronze-Total | Combined-Medals | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Afghanistan (AFG) | 13 | 0 | 0 | 2 | NaN | 0 | 0.0 | 0.0 | 0.0 | 0 | 13 | 0 | 0.0 | 2 | 2 |
1 | Algeria (ALG) | 12 | 5 | 2 | 8 | 15.0 | 3 | 0.0 | 0.0 | 0.0 | 0 | 15 | 5 | 2.0 | 8 | 15 |
2 | Argentina (ARG) | 23 | 18 | 24 | 28 | 70.0 | 18 | 0.0 | 0.0 | 0.0 | 0 | 41 | 18 | 24.0 | 28 | 70 |
3 | Armenia (ARM) | 5 | 1 | 2 | 9 | 12.0 | 6 | 0.0 | 0.0 | 0.0 | 0 | 11 | 1 | 2.0 | 9 | 12 |
4 | Australasia (ANZ) | 2 | 3 | 4 | 5 | 12.0 | 0 | 0.0 | 0.0 | 0.0 | 0 | 2 | 3 | 4.0 | 5 | 12 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
144 | Independent Olympic Participants (IOP) | 1 | 0 | 1 | 2 | 3.0 | 0 | 0.0 | 0.0 | 0.0 | 0 | 1 | 0 | 1.0 | 2 | 3 |
145 | Zambia (ZAM) | 12 | 0 | 1 | 1 | 2.0 | 0 | 0.0 | 0.0 | 0.0 | 0 | 12 | 0 | 1.0 | 1 | 2 |
146 | Zimbabwe (ZIM) | 12 | 3 | 4 | 1 | 8.0 | 1 | 0.0 | NaN | 0.0 | 0 | 13 | 3 | 4.0 | 1 | 8 |
147 | Mixed team (ZZX) | 3 | 8 | 5 | 4 | 17.0 | 0 | 0.0 | 0.0 | 0.0 | 0 | 3 | 8 | 5.0 | 4 | 17 |
148 | Totals | 27 | 4809 | 4775 | 5130 | 14714.0 | 22 | 959.0 | 958.0 | 948.0 | 2865 | 49 | 5768 | 5733.0 | 6078 | 17579 |
149 rows × 16 columns
Now, the columns content is much less opaque.
Let's look at the details:
The .rename
-method receives a mapping from the old name to the new names as input, and applies this map along the given dimension.
axis=0
would rename the index (not useful for us because for now we just work with simple numerical indices).
axis=1
applies the renaming along the column dimension.
Also, not that the function returns a renamed copy of the DataFrame
it was called on. You have to explicitly save it (e.g. save it in an old or new variable), to retain the changes to your data.
Another important step while cleaning your data is to atomize your features, meaning that each column should only contain one single value.
For the numerical parts of our dataset this is already the case, but if we look at the Country
columns we see, that it contains the country's fullname and its country code.
Let's disentangle them:
df.Country
0 Afghanistan (AFG) 1 Algeria (ALG) 2 Argentina (ARG) 3 Armenia (ARM) 4 Australasia (ANZ) ... 144 Independent Olympic Participants (IOP) 145 Zambia (ZAM) 146 Zimbabwe (ZIM) 147 Mixed team (ZZX) 148 Totals Name: Country, Length: 149, dtype: object
Because the Country
column only contains strings, Pandas
offers us a convenient set of string operations (e.g., the string-methods of Python's string you are already familiar with) that we can directly apply the values.
df["Code"] = df["Country"].str.findall(r"\(.+?\)").apply(lambda results: results[0].strip("()") if results else "")
df["Country"] = df["Country"].str.split(" ").apply(lambda parts: parts[0])
df
Country | Participation-Summer | Gold-Summer | Silver-Summer | Bronze-Summer | Combined-Medals-Summer | Participation-Winter | Gold-Winter | Silver-Winter | Bronze-Winter | Combined-Medals-Winter | Participation-Total | Gold-Total | Silver-Total | Bronze-Total | Combined-Medals | Code | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Afghanistan | 13 | 0 | 0 | 2 | NaN | 0 | 0.0 | 0.0 | 0.0 | 0 | 13 | 0 | 0.0 | 2 | 2 | AFG |
1 | Algeria | 12 | 5 | 2 | 8 | 15.0 | 3 | 0.0 | 0.0 | 0.0 | 0 | 15 | 5 | 2.0 | 8 | 15 | ALG |
2 | Argentina | 23 | 18 | 24 | 28 | 70.0 | 18 | 0.0 | 0.0 | 0.0 | 0 | 41 | 18 | 24.0 | 28 | 70 | ARG |
3 | Armenia | 5 | 1 | 2 | 9 | 12.0 | 6 | 0.0 | 0.0 | 0.0 | 0 | 11 | 1 | 2.0 | 9 | 12 | ARM |
4 | Australasia | 2 | 3 | 4 | 5 | 12.0 | 0 | 0.0 | 0.0 | 0.0 | 0 | 2 | 3 | 4.0 | 5 | 12 | ANZ |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
144 | Independent | 1 | 0 | 1 | 2 | 3.0 | 0 | 0.0 | 0.0 | 0.0 | 0 | 1 | 0 | 1.0 | 2 | 3 | IOP |
145 | Zambia | 12 | 0 | 1 | 1 | 2.0 | 0 | 0.0 | 0.0 | 0.0 | 0 | 12 | 0 | 1.0 | 1 | 2 | ZAM |
146 | Zimbabwe | 12 | 3 | 4 | 1 | 8.0 | 1 | 0.0 | NaN | 0.0 | 0 | 13 | 3 | 4.0 | 1 | 8 | ZIM |
147 | Mixed | 3 | 8 | 5 | 4 | 17.0 | 0 | 0.0 | 0.0 | 0.0 | 0 | 3 | 8 | 5.0 | 4 | 17 | ZZX |
148 | Totals | 27 | 4809 | 4775 | 5130 | 14714.0 | 22 | 959.0 | 958.0 | 948.0 | 2865 | 49 | 5768 | 5733.0 | 6078 | 17579 |
149 rows × 17 columns
Let's breakdown what we just did here:
First, we created a new column called Country-Code
.
To fill this column we used a regular expression to find all brackets and their content in the Country
column.
Because the findall
method returns a list of what is found, we had to extract those codes from these lists using the apply
-method of the column.
apply
receives an function as an argument, applies it to each entry in the series and returns a new series containing all the results.
We defined an anonymous lambda
function here, which either returns the first entry of a list (and removes the round brackets) or just an empty string if the list is empty.
We needed this special case empty string case because the last entry in our dataset is a row with the total values and thus does not have a country code.
Secondly, we updated the original Country
column by splitting it (based on spaces) and only taking the first value of these splits (= the full country names) as new values, thereby effectively removing the country codes from this column.
Now, that our dataset is correctly loaded and suitable named, let's check if the content itself is clean.
First, we check if there any duplications and if yes, drop them.
The identifying column of our dataset, is Country
, because the number of participations, or medals might obviously contain duplicate values.
Whenever we remove entries from our data, it's always as good advice to manually check what we are removing!
df.shape, df.drop_duplicates(subset=["Code"]).shape
((149, 17), (148, 17))
We see that there is one country appearing twice in our dataset, let's search for it.
df["Code"].value_counts()
UKR 2 AFG 1 POR 1 PAK 1 PAN 1 .. GBR 1 GRE 1 GRN 1 GUA 1 1 Name: Code, Length: 148, dtype: int64
Ukraine appears twice so, now let's see if the values are the same too!
df.query("Code == 'UKR'")
Country | Participation-Summer | Gold-Summer | Silver-Summer | Bronze-Summer | Combined-Medals-Summer | Participation-Winter | Gold-Winter | Silver-Winter | Bronze-Winter | Combined-Medals-Winter | Participation-Total | Gold-Total | Silver-Total | Bronze-Total | Combined-Medals | Code | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
134 | Ukraine | 5 | 33 | 27 | 55 | 115.0 | 6 | 2.0 | 1.0 | 4.0 | 7 | 11 | 35 | 28.0 | 59 | 122 | UKR |
135 | Ukraine | 5 | 33 | 27 | 55 | 115.0 | 6 | 2.0 | 1.0 | 4.0 | 7 | 11 | 35 | 28.0 | 59 | 122 | UKR |
Luckily this is the case, so we can just keep the first entry and remove the latter. (Which is exactly what the .remove_duplicates
-method does.)
df = df.drop_duplicates("Code")
df.query("Code == 'UKR'")
Country | Participation-Summer | Gold-Summer | Silver-Summer | Bronze-Summer | Combined-Medals-Summer | Participation-Winter | Gold-Winter | Silver-Winter | Bronze-Winter | Combined-Medals-Winter | Participation-Total | Gold-Total | Silver-Total | Bronze-Total | Combined-Medals | Code | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
134 | Ukraine | 5 | 33 | 27 | 55 | 115.0 | 6 | 2.0 | 1.0 | 4.0 | 7 | 11 | 35 | 28.0 | 59 | 122 | UKR |
Often, your data will be incomplete.
Those cases of missing values have to be handled carefully to both ensure the integrity of the data and also prevent just throwing away swaths of the dataset.
In general and only with few valid exceptions, only values that can be correctly deduced from other data can be reconstructed.
Entries that exhibit other values missing, should be removed, or manually corrected.
In our case, we can relatively safely repair missing values in the combined-*
field, by using the other fields to recompute them.
Also, we can potentially fill up the other fields under certain conditions.
But first, let's check how many entries lack entries:
Pandas
uses a special value np.NaN
(i.e., Not a Number
) to indicate missing values like Python uses None
.
In contrast to
None` values can be used in computations without provoking an exception.
But the result of any computation of the form values <operator> np.nan
will always result in another nan
value.
So whenever you receive a nan
values as result, go back to your data preprocessing stage and check if you handled all missing values.
import numpy as np
values = np.arange(5)
values = values + np.nan
values
array([nan, nan, nan, nan, nan])
Now let's check how many incomplete rows we have
df.shape[0] - df.dropna().shape[0]
4
If drop_na
is called without any additinal arguments it just removes any row in the DataFrame
that contains one or more nan
-values.
As we see now we have four rows with incomplete entries.
Let's dig a little bit deeper and look at those cases:
df[df.isna().any(axis=1)]
Country | Participation-Summer | Gold-Summer | Silver-Summer | Bronze-Summer | Combined-Medals-Summer | Participation-Winter | Gold-Winter | Silver-Winter | Bronze-Winter | Combined-Medals-Winter | Participation-Total | Gold-Total | Silver-Total | Bronze-Total | Combined-Medals | Code | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Afghanistan | 13 | 0 | 0 | 2 | NaN | 0 | 0.0 | 0.0 | 0.0 | 0 | 13 | 0 | 0.0 | 2 | 2 | AFG |
110 | Senegal | 13 | 0 | 1 | 0 | 1.0 | 5 | 0.0 | 0.0 | NaN | 0 | 18 | 0 | 1.0 | 0 | 1 | SEN |
128 | Togo | 9 | 0 | 0 | 1 | 1.0 | 1 | NaN | 0.0 | 0.0 | 0 | 10 | 0 | NaN | 1 | 1 | TOG |
146 | Zimbabwe | 12 | 3 | 4 | 1 | 8.0 | 1 | 0.0 | NaN | 0.0 | 0 | 13 | 3 | 4.0 | 1 | 8 | ZIM |
To find those rows we used two techniques that underline Panda's
Numpy
-backend.
The isna
-method returns a mask of the dataset where each cell that contains nan
is marked with True
and all other cells with False
.
df.isna()
Country | Participation-Summer | Gold-Summer | Silver-Summer | Bronze-Summer | Combined-Medals-Summer | Participation-Winter | Gold-Winter | Silver-Winter | Bronze-Winter | Combined-Medals-Winter | Participation-Total | Gold-Total | Silver-Total | Bronze-Total | Combined-Medals | Code | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | False | False | False | False | False | True | False | False | False | False | False | False | False | False | False | False | False |
1 | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False |
2 | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False |
3 | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False |
4 | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
144 | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False |
145 | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False |
146 | False | False | False | False | False | False | False | False | True | False | False | False | False | False | False | False | False |
147 | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False |
148 | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False |
148 rows × 17 columns
To be able to find those rows who contain at least one nan
value we can use the any
method to reduce those values along the row dimension (axis=1
).
The any
-method simply checks if any of the values along this dimension is True
df.isna().any(axis=1), df.isna().any(axis=1).shape
(0 True 1 False 2 False 3 False 4 False ... 144 False 145 False 146 True 147 False 148 False Length: 148, dtype: bool, (148,))
Finally, we can use this boolean array as row mask over our original dataset to only select those row, which contain at leat one nan
value
df[df.isna().any(axis=1)]
Country | Participation-Summer | Gold-Summer | Silver-Summer | Bronze-Summer | Combined-Medals-Summer | Participation-Winter | Gold-Winter | Silver-Winter | Bronze-Winter | Combined-Medals-Winter | Participation-Total | Gold-Total | Silver-Total | Bronze-Total | Combined-Medals | Code | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Afghanistan | 13 | 0 | 0 | 2 | NaN | 0 | 0.0 | 0.0 | 0.0 | 0 | 13 | 0 | 0.0 | 2 | 2 | AFG |
110 | Senegal | 13 | 0 | 1 | 0 | 1.0 | 5 | 0.0 | 0.0 | NaN | 0 | 18 | 0 | 1.0 | 0 | 1 | SEN |
128 | Togo | 9 | 0 | 0 | 1 | 1.0 | 1 | NaN | 0.0 | 0.0 | 0 | 10 | 0 | NaN | 1 | 1 | TOG |
146 | Zimbabwe | 12 | 3 | 4 | 1 | 8.0 | 1 | 0.0 | NaN | 0.0 | 0 | 13 | 3 | 4.0 | 1 | 8 | ZIM |
If we look carefully at the data, we see that we can luckily infer all the missing values using our domain knowledge and contextual data.
To do so, we have to directly index the nan
cell to fill in the repaired values.
We can do this, by using a row- and column wise index
df.at[0, "Combined-Medals-Summer"] = 2
df.at[110, "Bronze-Winter"] = 0
df.at[128, "Gold-Winter"] = 0
df.at[128, "Silver-Total"] = 0
df.at[146, "Silver-Winter"] = 0
df[df.isna().any(axis=1)]
Country | Participation-Summer | Gold-Summer | Silver-Summer | Bronze-Summer | Combined-Medals-Summer | Participation-Winter | Gold-Winter | Silver-Winter | Bronze-Winter | Combined-Medals-Winter | Participation-Total | Gold-Total | Silver-Total | Bronze-Total | Combined-Medals | Code |
---|
Sometimes we can leverage our domain knowledge and also search for implausible entries.
In our example, there are at least two possible ways to look for implausible records:
Let's start first by examining the countries:
Of course, we know that country code ought are always made up of three characters, so let's see if there any outliers:
df.query("Code.str.len() != 3")
Country | Participation-Summer | Gold-Summer | Silver-Summer | Bronze-Summer | Combined-Medals-Summer | Participation-Winter | Gold-Winter | Silver-Winter | Bronze-Winter | Combined-Medals-Winter | Participation-Total | Gold-Total | Silver-Total | Bronze-Total | Combined-Medals | Code | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
109 | Atlantis | 1 | 1 | 1 | 1 | 1.0 | 1 | 1.0 | 1.0 | 1.0 | 1 | 1 | 1 | 1.0 | 1 | 1 | ATLA |
148 | Totals | 27 | 4809 | 4775 | 5130 | 14714.0 | 22 | 959.0 | 958.0 | 948.0 | 2865 | 49 | 5768 | 5733.0 | 6078 | 17579 |
The mythic city-state of Atlantis is a product of imagination, and should be removed!
Once again, we employ the drop
-method to do this:
df = df.drop(109, axis="index")
df.query("Code == 'ATLA'")
Country | Participation-Summer | Gold-Summer | Silver-Summer | Bronze-Summer | Combined-Medals-Summer | Participation-Winter | Gold-Winter | Silver-Winter | Bronze-Winter | Combined-Medals-Winter | Participation-Total | Gold-Total | Silver-Total | Bronze-Total | Combined-Medals | Code |
---|
Now let's look if there are any countries that did not participate in the games:
df[df.sum(axis=1, numeric_only=True) == 0]
Country | Participation-Summer | Gold-Summer | Silver-Summer | Bronze-Summer | Combined-Medals-Summer | Participation-Winter | Gold-Winter | Silver-Winter | Bronze-Winter | Combined-Medals-Winter | Participation-Total | Gold-Total | Silver-Total | Bronze-Total | Combined-Medals | Code |
---|
Luckily there is none, so our dataset should now be somewhat clean.
The data cleaning preparation process in an iterative one, often while analyzing your data, you'll find additional issues or errors introduced by your operations and have to go back to your cleaning stage.
So make sure you to not overwrite the original data and safe your cleaning code!
One thing that we left out in this example is data normalization.
This is because this term can have multiple meanings:
[(walking, walked, walk) => walk]
(e.g., stemming, or preferably lemmatisation)We'll cover numerical normalization and text normalization in later sessions.
Often and for various reasons, we can't rely on prebuilt datasets and have to create them ourselves.
To check if there is no dataset, that fits our needs, there is a specialized search engine by google, tailored to find publicly available datasets:
Over the past few years the Huggingface Dataset Hub also has become a decent source for various text datasets:
Depending on your topic of interest there are also other options at your disposal:
For prominent platforms, there is a high chance that you can already rely on pre-built software to gather some data.
In this example, we'll scrape some articles from wikipedia using the wikipedia
Python package
pip install wikipedia
Also, we need two additional libraries:
tqdm
: Gives us a fancy progress barrequests
: Allows us to sent any type of HTTP-request anywherepip install tqdm requests
import wikipedia
import requests
page = wikipedia.page("Python (programming language)")
page.revision_id
print(page.url)
print(page.title)
print(page.content[:1000], "...\n\n")
print(page.categories)
https://en.wikipedia.org/wiki/Python_(programming_language) Python (programming language) Python is a high-level, general-purpose programming language. Its design philosophy emphasizes code readability with the use of significant indentation.Python is dynamically-typed and garbage-collected. It supports multiple programming paradigms, including structured (particularly procedural), object-oriented and functional programming. It is often described as a "batteries included" language due to its comprehensive standard library.Guido van Rossum began working on Python in the late 1980s as a successor to the ABC programming language and first released it in 1991 as Python 0.9.0. Python 2.0 was released in 2000 and introduced new features such as list comprehensions, cycle-detecting garbage collection, reference counting, and Unicode support. Python 3.0, released in 2008, was a major revision that is not completely backward-compatible with earlier versions. Python 2 was discontinued with version 2.7.18 in 2020.Python consistently ranks as one of the most popular programming languag ... ['All articles containing potentially dated statements', 'Articles containing potentially dated statements from November 2022', 'Articles containing potentially dated statements from October 2021', 'Articles with BNF identifiers', 'Articles with FAST identifiers', 'Articles with GND identifiers', 'Articles with J9U identifiers', 'Articles with LCCN identifiers', 'Articles with NKC identifiers', 'Articles with SUDOC identifiers', 'Articles with example Python (programming language) code', 'Articles with short description', 'CS1 maint: archived copy as title', 'Class-based programming languages', 'Computer science in the Netherlands', 'Concurrent programming languages', 'Cross-platform free software', 'Cross-platform software', 'Dutch inventions', 'Dynamically typed programming languages', 'Educational programming languages', 'Good articles', 'High-level programming languages', 'Information technology in the Netherlands', 'Multi-paradigm programming languages', 'Notebook interface', 'Object-oriented programming languages', 'Pages using Sister project links with hidden wikidata', 'Pages using Sister project links with wikidata namespace mismatch', 'Pattern matching programming languages', 'Programming languages', 'Programming languages created in 1991', 'Python (programming language)', 'Scripting languages', 'Short description matches Wikidata', 'Text-oriented programming languages', 'Use dmy dates from November 2021', 'Webarchive template wayback links']
import wikipedia
import requests
import pandas as pd
from tqdm.auto import tqdm
HEADERS = {'User-Agent': 'DataScienceBot/0.0 (https://example.org/DataScienceBot/; DataScienceBot@example.org)'}
def build_wiki_dataset(query: str, max_results: int = 100):
data = []
progress_bar = tqdm(wikipedia.search(query, results=max_results))
for result in progress_bar:
progress_bar.set_description(f"Downloading page {result}")
try:
page = wikipedia.page(result)
page_name = page.url.split("/")[-1]
edit_request = requests.get(
f"https://en.wikipedia.org/w/rest.php/v1/page/{page_name}/history/counts/edits?",
headers=HEADERS
)
num_edits = edit_request.json()["count"]
num_views_request = requests.get(
f"https://wikimedia.org/api/rest_v1/metrics/pageviews/per-article/en.wikipedia.org/all-access/all-agents/{page_name}/daily/2000010100/2022112700",
headers=HEADERS
)
num_views = sum([entry["views"] for entry in num_views_request.json()["items"]])
except Exception:
# We discard the page and move on
continue
page_data = {
"url": page.url,
"title": page.title,
"summary": page.summary,
"text": page.content,
"num_views": num_views,
"num_edits": num_edits,
"categories": ", ".join(page.categories),
"revision_id": page.revision_id
}
data.append(page_data)
df = pd.DataFrame.from_records(data)
return df
df = build_wiki_dataset("History of ", max_results=50)
0%| | 0/50 [00:00<?, ?it/s]
/opt/homebrew/Caskroom/miniforge/base/envs/python_intro/lib/python3.9/site-packages/wikipedia/wikipedia.py:389: GuessedAtParserWarning: No parser was explicitly specified, so I'm using the best available HTML parser for this system ("lxml"). This usually isn't a problem, but if you run this code on another system, or in a different virtual environment, it may use a different parser and behave differently. The code that caused this warning is on line 389 of the file /opt/homebrew/Caskroom/miniforge/base/envs/python_intro/lib/python3.9/site-packages/wikipedia/wikipedia.py. To get rid of this warning, pass the additional argument 'features="lxml"' to the BeautifulSoup constructor. lis = BeautifulSoup(html).find_all('li')
df
url | title | summary | text | num_views | num_edits | categories | revision_id | |
---|---|---|---|---|---|---|---|---|
0 | https://en.wikipedia.org/wiki/History | History | History (from Ancient Greek ἱστορία (historía... | History (from Ancient Greek ἱστορία (historía... | 8628272 | 6617 | All Wikipedia articles written in American Eng... | 1124009190 |
1 | https://en.wikipedia.org/wiki/History_of_Europe | History of Europe | The history of Europe is traditionally divided... | The history of Europe is traditionally divided... | 2764829 | 4533 | All articles needing additional references, Al... | 1123723906 |
2 | https://en.wikipedia.org/wiki/History_of_China | History of China | The earliest known written records of the hist... | The earliest known written records of the hist... | 6608376 | 7269 | All articles containing potentially dated stat... | 1124025086 |
3 | https://en.wikipedia.org/wiki/History_of_Islam | History of Islam | The history of Islam concerns the political, s... | The history of Islam concerns the political, s... | 5703155 | 6107 | All articles with unsourced statements, Articl... | 1123323843 |
4 | https://en.wikipedia.org/wiki/History_of_Japan | History of Japan | The first human inhabitants of the Japanese ar... | The first human inhabitants of the Japanese ar... | 5031470 | 7603 | All Wikipedia articles written in American Eng... | 1123804351 |
5 | https://en.wikipedia.org/wiki/History_of_telev... | History of television | The concept of television was the work of many... | The concept of television was the work of many... | 5074930 | 3906 | All articles to be expanded, All articles with... | 1123620788 |
6 | https://en.wikipedia.org/wiki/History_of_Chris... | History of Christianity | The history of Christianity concerns the Chris... | The history of Christianity concerns the Chris... | 3762801 | 5801 | All articles with bare URLs for citations, All... | 1123072552 |
7 | https://en.wikipedia.org/wiki/History_of_religion | History of religion | The history of religion refers to the written ... | The history of religion refers to the written ... | 1075565 | 654 | All articles with specifically marked weasel-w... | 1123342929 |
8 | https://en.wikipedia.org/wiki/History_of_Poland | History of Poland | The history of Poland spans over a thousand ye... | The history of Poland spans over a thousand ye... | 1542336 | 8199 | All articles lacking reliable references, Arti... | 1115808344 |
9 | https://en.wikipedia.org/wiki/End_of_history | End of history | The end of history is a political and philosop... | The end of history is a political and philosop... | 499923 | 105 | 1860s neologisms, All articles to be expanded,... | 1123918495 |
10 | https://en.wikipedia.org/wiki/History_of_Russia | History of Russia | The history of Russia begins with the historie... | The history of Russia begins with the historie... | 3289180 | 4944 | All articles with minor POV problems, All arti... | 1124023255 |
11 | https://en.wikipedia.org/wiki/History_of_banking | History of banking | The history of banking began with the first pr... | The history of banking began with the first pr... | 2333050 | 1619 | All articles with unsourced statements, All ar... | 1116409705 |
12 | https://en.wikipedia.org/wiki/History_of_Rome | History of Rome | The history of Rome includes the history of th... | The history of Rome includes the history of th... | 2079609 | 2796 | All articles containing potentially dated stat... | 1123947372 |
13 | https://en.wikipedia.org/wiki/History_of_Germany | History of Germany | The concept of Germany as a distinct region in... | The concept of Germany as a distinct region in... | 2741092 | 5078 | All articles lacking reliable references, Arti... | 1124396390 |
14 | https://en.wikipedia.org/wiki/History_of_Spain | History of Spain | The history of Spain dates to the pre-Roman pe... | The history of Spain dates to the pre-Roman pe... | 2355382 | 4119 | All accuracy disputes, All articles with dead ... | 1124373367 |
15 | https://en.wikipedia.org/wiki/History_of_Austr... | History of Australia | The history of Australia is the story of the l... | The history of Australia is the story of the l... | 3782171 | 5856 | AC with 0 elements, All Wikipedia articles wri... | 1123124583 |
16 | https://en.wikipedia.org/wiki/History_of_Arda | History of Arda | In J. R. R. Tolkien's legendarium, the history... | In J. R. R. Tolkien's legendarium, the history... | 605422 | 610 | Articles with short description, Fictional his... | 1123256525 |
17 | https://en.wikipedia.org/wiki/History_of_mathe... | History of mathematics | The history of mathematics deals with the orig... | The history of mathematics deals with the orig... | 3315598 | 4317 | All articles needing additional references, Al... | 1122788585 |
18 | https://en.wikipedia.org/wiki/History_of_Bangl... | History of Bangladesh | Civilisational history of Bangladesh previousl... | Civilisational history of Bangladesh previousl... | 1835490 | 2666 | All articles lacking reliable references, All ... | 1123069979 |
19 | https://en.wikipedia.org/wiki/History_of_Hinduism | History of Hinduism | The history of Hinduism covers a wide variety ... | The history of Hinduism covers a wide variety ... | 1797592 | 2935 | All articles lacking reliable references, All ... | 1124512840 |
20 | https://en.wikipedia.org/wiki/History_of_Ireland | History of Ireland | The first evidence of human presence in Irelan... | The first evidence of human presence in Irelan... | 2701679 | 3338 | All accuracy disputes, All articles with style... | 1123697361 |
21 | https://en.wikipedia.org/wiki/History_of_agric... | History of agriculture | Agriculture began independently in different p... | Agriculture began independently in different p... | 2516191 | 1977 | All articles with dead external links, Article... | 1123034506 |
22 | https://en.wikipedia.org/wiki/History_of_pharmacy | History of pharmacy | The history of pharmacy as an independent scie... | The history of pharmacy as an independent scie... | 450258 | 174 | All articles with unsourced statements, Articl... | 1113261032 |
23 | https://en.wikipedia.org/wiki/History_of_democ... | History of democracy | A democracy is a political system, or a system... | A democracy is a political system, or a system... | 1972822 | 2383 | All Wikipedia articles needing clarification, ... | 1122481158 |
24 | https://en.wikipedia.org/wiki/History_of_pizza | History of pizza | The history of pizza begins in antiquity, as v... | The history of pizza begins in antiquity, as v... | 4086441 | 4146 | All articles with failed verification, All art... | 1124192347 |
25 | https://en.wikipedia.org/wiki/History_of_syphilis | History of syphilis | The first recorded outbreak of syphilis in Eur... | The first recorded outbreak of syphilis in Eur... | 1548048 | 479 | All articles with dead external links, All art... | 1123246612 |
26 | https://en.wikipedia.org/wiki/History_of_Italy | History of Italy | The history of Italy covers the ancient period... | The history of Italy covers the ancient period... | 1824793 | 3518 | All articles with incomplete citations, Articl... | 1124364753 |
27 | https://en.wikipedia.org/wiki/History_of_medicine | History of medicine | The history of medicine is both a study of med... | The history of medicine is both a study of med... | 2008806 | 2819 | All articles containing potentially dated stat... | 1123053788 |
28 | https://en.wikipedia.org/wiki/History_of_calen... | History of calendars | The history of calendars, that is, of people c... | The history of calendars, that is, of people c... | 1411516 | 680 | All articles with incomplete citations, All ar... | 1123446945 |
29 | https://en.wikipedia.org/wiki/History_of_Facebook | History of Facebook | Facebook is a social networking service origin... | Facebook is a social networking service origin... | 4916523 | 1735 | All Wikipedia articles in need of updating, Al... | 1124469003 |
30 | https://en.wikipedia.org/wiki/History_of_cricket | History of cricket | The sport of cricket has a known history begin... | The sport of cricket has a known history begin... | 3234448 | 2040 | All articles needing additional references, Al... | 1123571727 |
31 | https://en.wikipedia.org/wiki/History_of_Earth | History of Earth | The history of Earth concerns the development ... | The history of Earth concerns the development ... | 4602476 | 4317 | AC with 0 elements, All articles with unsource... | 1124371678 |
32 | https://en.wikipedia.org/wiki/History_of_the_U... | History of the United States | The history of the lands that became the Unite... | The history of the lands that became the Unite... | 6916785 | 11165 | All Wikipedia articles needing clarification, ... | 1124535526 |
33 | https://en.wikipedia.org/wiki/History_of_Afgha... | History of Afghanistan | The history of Afghanistan as a state began in... | The history of Afghanistan as a state began in... | 2155957 | 2971 | All articles with dead external links, Article... | 1121424232 |
34 | https://en.wikipedia.org/wiki/History_of_Korea | History of Korea | The Lower Paleolithic era in the Korean Penins... | The Lower Paleolithic era in the Korean Penins... | 2743246 | 3470 | Accuracy disputes from December 2017, All accu... | 1124522331 |
35 | https://en.wikipedia.org/wiki/History_of_baske... | History of basketball | Basketball began with its invention in 1891 in... | Basketball began with its invention in 1891 in... | 4434485 | 3010 | All articles to be expanded, All articles with... | 1117994066 |
36 | https://en.wikipedia.org/wiki/History_of_commu... | History of communication | The history of communication technologies (med... | The history of communication technologies (med... | 1620064 | 957 | All articles needing additional references, Al... | 1123372214 |
37 | https://en.wikipedia.org/wiki/History_of_marke... | History of marketing | The study of the history of marketing, as a di... | The study of the history of marketing, as a di... | 931211 | 785 | Articles with short description, EngvarB from ... | 1117474637 |
38 | https://en.wikipedia.org/wiki/History_of_Lambo... | History of Lamborghini | Automobili Lamborghini S.p.A. is an Italian br... | Automobili Lamborghini S.p.A. is an Italian br... | 960020 | 271 | All articles with unsourced statements, Articl... | 1108125867 |
39 | https://en.wikipedia.org/wiki/History_of_Pakistan | History of Pakistan | The history of the lands of Pakistan for the p... | The history of the lands of Pakistan for the p... | 1806158 | 5054 | All Wikipedia articles written in Pakistani En... | 1123802420 |
40 | https://en.wikipedia.org/wiki/History_of_cotton | History of cotton | The history of cotton can be traced to domesti... | The history of cotton can be traced to domesti... | 716500 | 343 | All articles with unsourced statements, Articl... | 1118655355 |
41 | https://en.wikipedia.org/wiki/History_of_cheese | History of cheese | The production of cheese predates recorded his... | The production of cheese predates recorded his... | 653140 | 632 | All articles with vague or ambiguous time, Art... | 1123553575 |
42 | https://en.wikipedia.org/wiki/History_of_Belgium | History of Belgium | The history of Belgium extends before the foun... | The history of Belgium extends before the foun... | 1119414 | 2301 | All articles covered by WikiProject Wikify, Al... | 1124232962 |
43 | https://en.wikipedia.org/wiki/History_of_corsets | History of corsets | The corset has been an indispensable supportiv... | The corset has been an indispensable supportiv... | 706954 | 553 | 16th-century fashion, 17th-century fashion, 18... | 1083739897 |
44 | https://en.wikipedia.org/wiki/History_of_Ghana | History of Ghana | The Republic of Ghana is named after the medie... | The Republic of Ghana is named after the medie... | 1305934 | 1479 | All Wikipedia articles written in Ghanaian Eng... | 1123589131 |
df.sort_values("num_views")
url | title | summary | text | num_views | num_edits | categories | revision_id | |
---|---|---|---|---|---|---|---|---|
22 | https://en.wikipedia.org/wiki/History_of_pharmacy | History of pharmacy | The history of pharmacy as an independent scie... | The history of pharmacy as an independent scie... | 450258 | 174 | All articles with unsourced statements, Articl... | 1113261032 |
9 | https://en.wikipedia.org/wiki/End_of_history | End of history | The end of history is a political and philosop... | The end of history is a political and philosop... | 499923 | 105 | 1860s neologisms, All articles to be expanded,... | 1123918495 |
16 | https://en.wikipedia.org/wiki/History_of_Arda | History of Arda | In J. R. R. Tolkien's legendarium, the history... | In J. R. R. Tolkien's legendarium, the history... | 605422 | 610 | Articles with short description, Fictional his... | 1123256525 |
41 | https://en.wikipedia.org/wiki/History_of_cheese | History of cheese | The production of cheese predates recorded his... | The production of cheese predates recorded his... | 653140 | 632 | All articles with vague or ambiguous time, Art... | 1123553575 |
43 | https://en.wikipedia.org/wiki/History_of_corsets | History of corsets | The corset has been an indispensable supportiv... | The corset has been an indispensable supportiv... | 706954 | 553 | 16th-century fashion, 17th-century fashion, 18... | 1083739897 |
40 | https://en.wikipedia.org/wiki/History_of_cotton | History of cotton | The history of cotton can be traced to domesti... | The history of cotton can be traced to domesti... | 716500 | 343 | All articles with unsourced statements, Articl... | 1118655355 |
37 | https://en.wikipedia.org/wiki/History_of_marke... | History of marketing | The study of the history of marketing, as a di... | The study of the history of marketing, as a di... | 931211 | 785 | Articles with short description, EngvarB from ... | 1117474637 |
38 | https://en.wikipedia.org/wiki/History_of_Lambo... | History of Lamborghini | Automobili Lamborghini S.p.A. is an Italian br... | Automobili Lamborghini S.p.A. is an Italian br... | 960020 | 271 | All articles with unsourced statements, Articl... | 1108125867 |
7 | https://en.wikipedia.org/wiki/History_of_religion | History of religion | The history of religion refers to the written ... | The history of religion refers to the written ... | 1075565 | 654 | All articles with specifically marked weasel-w... | 1123342929 |
42 | https://en.wikipedia.org/wiki/History_of_Belgium | History of Belgium | The history of Belgium extends before the foun... | The history of Belgium extends before the foun... | 1119414 | 2301 | All articles covered by WikiProject Wikify, Al... | 1124232962 |
44 | https://en.wikipedia.org/wiki/History_of_Ghana | History of Ghana | The Republic of Ghana is named after the medie... | The Republic of Ghana is named after the medie... | 1305934 | 1479 | All Wikipedia articles written in Ghanaian Eng... | 1123589131 |
28 | https://en.wikipedia.org/wiki/History_of_calen... | History of calendars | The history of calendars, that is, of people c... | The history of calendars, that is, of people c... | 1411516 | 680 | All articles with incomplete citations, All ar... | 1123446945 |
8 | https://en.wikipedia.org/wiki/History_of_Poland | History of Poland | The history of Poland spans over a thousand ye... | The history of Poland spans over a thousand ye... | 1542336 | 8199 | All articles lacking reliable references, Arti... | 1115808344 |
25 | https://en.wikipedia.org/wiki/History_of_syphilis | History of syphilis | The first recorded outbreak of syphilis in Eur... | The first recorded outbreak of syphilis in Eur... | 1548048 | 479 | All articles with dead external links, All art... | 1123246612 |
36 | https://en.wikipedia.org/wiki/History_of_commu... | History of communication | The history of communication technologies (med... | The history of communication technologies (med... | 1620064 | 957 | All articles needing additional references, Al... | 1123372214 |
19 | https://en.wikipedia.org/wiki/History_of_Hinduism | History of Hinduism | The history of Hinduism covers a wide variety ... | The history of Hinduism covers a wide variety ... | 1797592 | 2935 | All articles lacking reliable references, All ... | 1124512840 |
39 | https://en.wikipedia.org/wiki/History_of_Pakistan | History of Pakistan | The history of the lands of Pakistan for the p... | The history of the lands of Pakistan for the p... | 1806158 | 5054 | All Wikipedia articles written in Pakistani En... | 1123802420 |
26 | https://en.wikipedia.org/wiki/History_of_Italy | History of Italy | The history of Italy covers the ancient period... | The history of Italy covers the ancient period... | 1824793 | 3518 | All articles with incomplete citations, Articl... | 1124364753 |
18 | https://en.wikipedia.org/wiki/History_of_Bangl... | History of Bangladesh | Civilisational history of Bangladesh previousl... | Civilisational history of Bangladesh previousl... | 1835490 | 2666 | All articles lacking reliable references, All ... | 1123069979 |
23 | https://en.wikipedia.org/wiki/History_of_democ... | History of democracy | A democracy is a political system, or a system... | A democracy is a political system, or a system... | 1972822 | 2383 | All Wikipedia articles needing clarification, ... | 1122481158 |
27 | https://en.wikipedia.org/wiki/History_of_medicine | History of medicine | The history of medicine is both a study of med... | The history of medicine is both a study of med... | 2008806 | 2819 | All articles containing potentially dated stat... | 1123053788 |
12 | https://en.wikipedia.org/wiki/History_of_Rome | History of Rome | The history of Rome includes the history of th... | The history of Rome includes the history of th... | 2079609 | 2796 | All articles containing potentially dated stat... | 1123947372 |
33 | https://en.wikipedia.org/wiki/History_of_Afgha... | History of Afghanistan | The history of Afghanistan as a state began in... | The history of Afghanistan as a state began in... | 2155957 | 2971 | All articles with dead external links, Article... | 1121424232 |
11 | https://en.wikipedia.org/wiki/History_of_banking | History of banking | The history of banking began with the first pr... | The history of banking began with the first pr... | 2333050 | 1619 | All articles with unsourced statements, All ar... | 1116409705 |
14 | https://en.wikipedia.org/wiki/History_of_Spain | History of Spain | The history of Spain dates to the pre-Roman pe... | The history of Spain dates to the pre-Roman pe... | 2355382 | 4119 | All accuracy disputes, All articles with dead ... | 1124373367 |
21 | https://en.wikipedia.org/wiki/History_of_agric... | History of agriculture | Agriculture began independently in different p... | Agriculture began independently in different p... | 2516191 | 1977 | All articles with dead external links, Article... | 1123034506 |
20 | https://en.wikipedia.org/wiki/History_of_Ireland | History of Ireland | The first evidence of human presence in Irelan... | The first evidence of human presence in Irelan... | 2701679 | 3338 | All accuracy disputes, All articles with style... | 1123697361 |
13 | https://en.wikipedia.org/wiki/History_of_Germany | History of Germany | The concept of Germany as a distinct region in... | The concept of Germany as a distinct region in... | 2741092 | 5078 | All articles lacking reliable references, Arti... | 1124396390 |
34 | https://en.wikipedia.org/wiki/History_of_Korea | History of Korea | The Lower Paleolithic era in the Korean Penins... | The Lower Paleolithic era in the Korean Penins... | 2743246 | 3470 | Accuracy disputes from December 2017, All accu... | 1124522331 |
1 | https://en.wikipedia.org/wiki/History_of_Europe | History of Europe | The history of Europe is traditionally divided... | The history of Europe is traditionally divided... | 2764829 | 4533 | All articles needing additional references, Al... | 1123723906 |
30 | https://en.wikipedia.org/wiki/History_of_cricket | History of cricket | The sport of cricket has a known history begin... | The sport of cricket has a known history begin... | 3234448 | 2040 | All articles needing additional references, Al... | 1123571727 |
10 | https://en.wikipedia.org/wiki/History_of_Russia | History of Russia | The history of Russia begins with the historie... | The history of Russia begins with the historie... | 3289180 | 4944 | All articles with minor POV problems, All arti... | 1124023255 |
17 | https://en.wikipedia.org/wiki/History_of_mathe... | History of mathematics | The history of mathematics deals with the orig... | The history of mathematics deals with the orig... | 3315598 | 4317 | All articles needing additional references, Al... | 1122788585 |
6 | https://en.wikipedia.org/wiki/History_of_Chris... | History of Christianity | The history of Christianity concerns the Chris... | The history of Christianity concerns the Chris... | 3762801 | 5801 | All articles with bare URLs for citations, All... | 1123072552 |
15 | https://en.wikipedia.org/wiki/History_of_Austr... | History of Australia | The history of Australia is the story of the l... | The history of Australia is the story of the l... | 3782171 | 5856 | AC with 0 elements, All Wikipedia articles wri... | 1123124583 |
24 | https://en.wikipedia.org/wiki/History_of_pizza | History of pizza | The history of pizza begins in antiquity, as v... | The history of pizza begins in antiquity, as v... | 4086441 | 4146 | All articles with failed verification, All art... | 1124192347 |
35 | https://en.wikipedia.org/wiki/History_of_baske... | History of basketball | Basketball began with its invention in 1891 in... | Basketball began with its invention in 1891 in... | 4434485 | 3010 | All articles to be expanded, All articles with... | 1117994066 |
31 | https://en.wikipedia.org/wiki/History_of_Earth | History of Earth | The history of Earth concerns the development ... | The history of Earth concerns the development ... | 4602476 | 4317 | AC with 0 elements, All articles with unsource... | 1124371678 |
29 | https://en.wikipedia.org/wiki/History_of_Facebook | History of Facebook | Facebook is a social networking service origin... | Facebook is a social networking service origin... | 4916523 | 1735 | All Wikipedia articles in need of updating, Al... | 1124469003 |
4 | https://en.wikipedia.org/wiki/History_of_Japan | History of Japan | The first human inhabitants of the Japanese ar... | The first human inhabitants of the Japanese ar... | 5031470 | 7603 | All Wikipedia articles written in American Eng... | 1123804351 |
5 | https://en.wikipedia.org/wiki/History_of_telev... | History of television | The concept of television was the work of many... | The concept of television was the work of many... | 5074930 | 3906 | All articles to be expanded, All articles with... | 1123620788 |
3 | https://en.wikipedia.org/wiki/History_of_Islam | History of Islam | The history of Islam concerns the political, s... | The history of Islam concerns the political, s... | 5703155 | 6107 | All articles with unsourced statements, Articl... | 1123323843 |
2 | https://en.wikipedia.org/wiki/History_of_China | History of China | The earliest known written records of the hist... | The earliest known written records of the hist... | 6608376 | 7269 | All articles containing potentially dated stat... | 1124025086 |
32 | https://en.wikipedia.org/wiki/History_of_the_U... | History of the United States | The history of the lands that became the Unite... | The history of the lands that became the Unite... | 6916785 | 11165 | All Wikipedia articles needing clarification, ... | 1124535526 |
0 | https://en.wikipedia.org/wiki/History | History | History (from Ancient Greek ἱστορία (historía... | History (from Ancient Greek ἱστορία (historía... | 8628272 | 6617 | All Wikipedia articles written in American Eng... | 1124009190 |
df.sort_values("num_edits")
url | title | summary | text | num_views | num_edits | categories | revision_id | |
---|---|---|---|---|---|---|---|---|
9 | https://en.wikipedia.org/wiki/End_of_history | End of history | The end of history is a political and philosop... | The end of history is a political and philosop... | 499923 | 105 | 1860s neologisms, All articles to be expanded,... | 1123918495 |
22 | https://en.wikipedia.org/wiki/History_of_pharmacy | History of pharmacy | The history of pharmacy as an independent scie... | The history of pharmacy as an independent scie... | 450258 | 174 | All articles with unsourced statements, Articl... | 1113261032 |
38 | https://en.wikipedia.org/wiki/History_of_Lambo... | History of Lamborghini | Automobili Lamborghini S.p.A. is an Italian br... | Automobili Lamborghini S.p.A. is an Italian br... | 960020 | 271 | All articles with unsourced statements, Articl... | 1108125867 |
40 | https://en.wikipedia.org/wiki/History_of_cotton | History of cotton | The history of cotton can be traced to domesti... | The history of cotton can be traced to domesti... | 716500 | 343 | All articles with unsourced statements, Articl... | 1118655355 |
25 | https://en.wikipedia.org/wiki/History_of_syphilis | History of syphilis | The first recorded outbreak of syphilis in Eur... | The first recorded outbreak of syphilis in Eur... | 1548048 | 479 | All articles with dead external links, All art... | 1123246612 |
43 | https://en.wikipedia.org/wiki/History_of_corsets | History of corsets | The corset has been an indispensable supportiv... | The corset has been an indispensable supportiv... | 706954 | 553 | 16th-century fashion, 17th-century fashion, 18... | 1083739897 |
16 | https://en.wikipedia.org/wiki/History_of_Arda | History of Arda | In J. R. R. Tolkien's legendarium, the history... | In J. R. R. Tolkien's legendarium, the history... | 605422 | 610 | Articles with short description, Fictional his... | 1123256525 |
41 | https://en.wikipedia.org/wiki/History_of_cheese | History of cheese | The production of cheese predates recorded his... | The production of cheese predates recorded his... | 653140 | 632 | All articles with vague or ambiguous time, Art... | 1123553575 |
7 | https://en.wikipedia.org/wiki/History_of_religion | History of religion | The history of religion refers to the written ... | The history of religion refers to the written ... | 1075565 | 654 | All articles with specifically marked weasel-w... | 1123342929 |
28 | https://en.wikipedia.org/wiki/History_of_calen... | History of calendars | The history of calendars, that is, of people c... | The history of calendars, that is, of people c... | 1411516 | 680 | All articles with incomplete citations, All ar... | 1123446945 |
37 | https://en.wikipedia.org/wiki/History_of_marke... | History of marketing | The study of the history of marketing, as a di... | The study of the history of marketing, as a di... | 931211 | 785 | Articles with short description, EngvarB from ... | 1117474637 |
36 | https://en.wikipedia.org/wiki/History_of_commu... | History of communication | The history of communication technologies (med... | The history of communication technologies (med... | 1620064 | 957 | All articles needing additional references, Al... | 1123372214 |
44 | https://en.wikipedia.org/wiki/History_of_Ghana | History of Ghana | The Republic of Ghana is named after the medie... | The Republic of Ghana is named after the medie... | 1305934 | 1479 | All Wikipedia articles written in Ghanaian Eng... | 1123589131 |
11 | https://en.wikipedia.org/wiki/History_of_banking | History of banking | The history of banking began with the first pr... | The history of banking began with the first pr... | 2333050 | 1619 | All articles with unsourced statements, All ar... | 1116409705 |
29 | https://en.wikipedia.org/wiki/History_of_Facebook | History of Facebook | Facebook is a social networking service origin... | Facebook is a social networking service origin... | 4916523 | 1735 | All Wikipedia articles in need of updating, Al... | 1124469003 |
21 | https://en.wikipedia.org/wiki/History_of_agric... | History of agriculture | Agriculture began independently in different p... | Agriculture began independently in different p... | 2516191 | 1977 | All articles with dead external links, Article... | 1123034506 |
30 | https://en.wikipedia.org/wiki/History_of_cricket | History of cricket | The sport of cricket has a known history begin... | The sport of cricket has a known history begin... | 3234448 | 2040 | All articles needing additional references, Al... | 1123571727 |
42 | https://en.wikipedia.org/wiki/History_of_Belgium | History of Belgium | The history of Belgium extends before the foun... | The history of Belgium extends before the foun... | 1119414 | 2301 | All articles covered by WikiProject Wikify, Al... | 1124232962 |
23 | https://en.wikipedia.org/wiki/History_of_democ... | History of democracy | A democracy is a political system, or a system... | A democracy is a political system, or a system... | 1972822 | 2383 | All Wikipedia articles needing clarification, ... | 1122481158 |
18 | https://en.wikipedia.org/wiki/History_of_Bangl... | History of Bangladesh | Civilisational history of Bangladesh previousl... | Civilisational history of Bangladesh previousl... | 1835490 | 2666 | All articles lacking reliable references, All ... | 1123069979 |
12 | https://en.wikipedia.org/wiki/History_of_Rome | History of Rome | The history of Rome includes the history of th... | The history of Rome includes the history of th... | 2079609 | 2796 | All articles containing potentially dated stat... | 1123947372 |
27 | https://en.wikipedia.org/wiki/History_of_medicine | History of medicine | The history of medicine is both a study of med... | The history of medicine is both a study of med... | 2008806 | 2819 | All articles containing potentially dated stat... | 1123053788 |
19 | https://en.wikipedia.org/wiki/History_of_Hinduism | History of Hinduism | The history of Hinduism covers a wide variety ... | The history of Hinduism covers a wide variety ... | 1797592 | 2935 | All articles lacking reliable references, All ... | 1124512840 |
33 | https://en.wikipedia.org/wiki/History_of_Afgha... | History of Afghanistan | The history of Afghanistan as a state began in... | The history of Afghanistan as a state began in... | 2155957 | 2971 | All articles with dead external links, Article... | 1121424232 |
35 | https://en.wikipedia.org/wiki/History_of_baske... | History of basketball | Basketball began with its invention in 1891 in... | Basketball began with its invention in 1891 in... | 4434485 | 3010 | All articles to be expanded, All articles with... | 1117994066 |
20 | https://en.wikipedia.org/wiki/History_of_Ireland | History of Ireland | The first evidence of human presence in Irelan... | The first evidence of human presence in Irelan... | 2701679 | 3338 | All accuracy disputes, All articles with style... | 1123697361 |
34 | https://en.wikipedia.org/wiki/History_of_Korea | History of Korea | The Lower Paleolithic era in the Korean Penins... | The Lower Paleolithic era in the Korean Penins... | 2743246 | 3470 | Accuracy disputes from December 2017, All accu... | 1124522331 |
26 | https://en.wikipedia.org/wiki/History_of_Italy | History of Italy | The history of Italy covers the ancient period... | The history of Italy covers the ancient period... | 1824793 | 3518 | All articles with incomplete citations, Articl... | 1124364753 |
5 | https://en.wikipedia.org/wiki/History_of_telev... | History of television | The concept of television was the work of many... | The concept of television was the work of many... | 5074930 | 3906 | All articles to be expanded, All articles with... | 1123620788 |
14 | https://en.wikipedia.org/wiki/History_of_Spain | History of Spain | The history of Spain dates to the pre-Roman pe... | The history of Spain dates to the pre-Roman pe... | 2355382 | 4119 | All accuracy disputes, All articles with dead ... | 1124373367 |
24 | https://en.wikipedia.org/wiki/History_of_pizza | History of pizza | The history of pizza begins in antiquity, as v... | The history of pizza begins in antiquity, as v... | 4086441 | 4146 | All articles with failed verification, All art... | 1124192347 |
31 | https://en.wikipedia.org/wiki/History_of_Earth | History of Earth | The history of Earth concerns the development ... | The history of Earth concerns the development ... | 4602476 | 4317 | AC with 0 elements, All articles with unsource... | 1124371678 |
17 | https://en.wikipedia.org/wiki/History_of_mathe... | History of mathematics | The history of mathematics deals with the orig... | The history of mathematics deals with the orig... | 3315598 | 4317 | All articles needing additional references, Al... | 1122788585 |
1 | https://en.wikipedia.org/wiki/History_of_Europe | History of Europe | The history of Europe is traditionally divided... | The history of Europe is traditionally divided... | 2764829 | 4533 | All articles needing additional references, Al... | 1123723906 |
10 | https://en.wikipedia.org/wiki/History_of_Russia | History of Russia | The history of Russia begins with the historie... | The history of Russia begins with the historie... | 3289180 | 4944 | All articles with minor POV problems, All arti... | 1124023255 |
39 | https://en.wikipedia.org/wiki/History_of_Pakistan | History of Pakistan | The history of the lands of Pakistan for the p... | The history of the lands of Pakistan for the p... | 1806158 | 5054 | All Wikipedia articles written in Pakistani En... | 1123802420 |
13 | https://en.wikipedia.org/wiki/History_of_Germany | History of Germany | The concept of Germany as a distinct region in... | The concept of Germany as a distinct region in... | 2741092 | 5078 | All articles lacking reliable references, Arti... | 1124396390 |
6 | https://en.wikipedia.org/wiki/History_of_Chris... | History of Christianity | The history of Christianity concerns the Chris... | The history of Christianity concerns the Chris... | 3762801 | 5801 | All articles with bare URLs for citations, All... | 1123072552 |
15 | https://en.wikipedia.org/wiki/History_of_Austr... | History of Australia | The history of Australia is the story of the l... | The history of Australia is the story of the l... | 3782171 | 5856 | AC with 0 elements, All Wikipedia articles wri... | 1123124583 |
3 | https://en.wikipedia.org/wiki/History_of_Islam | History of Islam | The history of Islam concerns the political, s... | The history of Islam concerns the political, s... | 5703155 | 6107 | All articles with unsourced statements, Articl... | 1123323843 |
0 | https://en.wikipedia.org/wiki/History | History | History (from Ancient Greek ἱστορία (historía... | History (from Ancient Greek ἱστορία (historía... | 8628272 | 6617 | All Wikipedia articles written in American Eng... | 1124009190 |
2 | https://en.wikipedia.org/wiki/History_of_China | History of China | The earliest known written records of the hist... | The earliest known written records of the hist... | 6608376 | 7269 | All articles containing potentially dated stat... | 1124025086 |
4 | https://en.wikipedia.org/wiki/History_of_Japan | History of Japan | The first human inhabitants of the Japanese ar... | The first human inhabitants of the Japanese ar... | 5031470 | 7603 | All Wikipedia articles written in American Eng... | 1123804351 |
8 | https://en.wikipedia.org/wiki/History_of_Poland | History of Poland | The history of Poland spans over a thousand ye... | The history of Poland spans over a thousand ye... | 1542336 | 8199 | All articles lacking reliable references, Arti... | 1115808344 |
32 | https://en.wikipedia.org/wiki/History_of_the_U... | History of the United States | The history of the lands that became the Unite... | The history of the lands that became the Unite... | 6916785 | 11165 | All Wikipedia articles needing clarification, ... | 1124535526 |
df.shape
(45, 8)
df["text_length"] = df.text.str.split().apply(len)
df.sort_values("text_length")
url | title | summary | text | num_views | num_edits | categories | revision_id | text_length | |
---|---|---|---|---|---|---|---|---|---|
9 | https://en.wikipedia.org/wiki/End_of_history | End of history | The end of history is a political and philosop... | The end of history is a political and philosop... | 499923 | 105 | 1860s neologisms, All articles to be expanded,... | 1123918495 | 814 |
41 | https://en.wikipedia.org/wiki/History_of_cheese | History of cheese | The production of cheese predates recorded his... | The production of cheese predates recorded his... | 653140 | 632 | All articles with vague or ambiguous time, Art... | 1123553575 | 1453 |
7 | https://en.wikipedia.org/wiki/History_of_religion | History of religion | The history of religion refers to the written ... | The history of religion refers to the written ... | 1075565 | 654 | All articles with specifically marked weasel-w... | 1123342929 | 1652 |
43 | https://en.wikipedia.org/wiki/History_of_corsets | History of corsets | The corset has been an indispensable supportiv... | The corset has been an indispensable supportiv... | 706954 | 553 | 16th-century fashion, 17th-century fashion, 18... | 1083739897 | 1671 |
22 | https://en.wikipedia.org/wiki/History_of_pharmacy | History of pharmacy | The history of pharmacy as an independent scie... | The history of pharmacy as an independent scie... | 450258 | 174 | All articles with unsourced statements, Articl... | 1113261032 | 1700 |
24 | https://en.wikipedia.org/wiki/History_of_pizza | History of pizza | The history of pizza begins in antiquity, as v... | The history of pizza begins in antiquity, as v... | 4086441 | 4146 | All articles with failed verification, All art... | 1124192347 | 2640 |
36 | https://en.wikipedia.org/wiki/History_of_commu... | History of communication | The history of communication technologies (med... | The history of communication technologies (med... | 1620064 | 957 | All articles needing additional references, Al... | 1123372214 | 2983 |
29 | https://en.wikipedia.org/wiki/History_of_Facebook | History of Facebook | Facebook is a social networking service origin... | Facebook is a social networking service origin... | 4916523 | 1735 | All Wikipedia articles in need of updating, Al... | 1124469003 | 3215 |
40 | https://en.wikipedia.org/wiki/History_of_cotton | History of cotton | The history of cotton can be traced to domesti... | The history of cotton can be traced to domesti... | 716500 | 343 | All articles with unsourced statements, Articl... | 1118655355 | 4067 |
30 | https://en.wikipedia.org/wiki/History_of_cricket | History of cricket | The sport of cricket has a known history begin... | The sport of cricket has a known history begin... | 3234448 | 2040 | All articles needing additional references, Al... | 1123571727 | 4203 |
35 | https://en.wikipedia.org/wiki/History_of_baske... | History of basketball | Basketball began with its invention in 1891 in... | Basketball began with its invention in 1891 in... | 4434485 | 3010 | All articles to be expanded, All articles with... | 1117994066 | 4218 |
25 | https://en.wikipedia.org/wiki/History_of_syphilis | History of syphilis | The first recorded outbreak of syphilis in Eur... | The first recorded outbreak of syphilis in Eur... | 1548048 | 479 | All articles with dead external links, All art... | 1123246612 | 4314 |
28 | https://en.wikipedia.org/wiki/History_of_calen... | History of calendars | The history of calendars, that is, of people c... | The history of calendars, that is, of people c... | 1411516 | 680 | All articles with incomplete citations, All ar... | 1123446945 | 4801 |
16 | https://en.wikipedia.org/wiki/History_of_Arda | History of Arda | In J. R. R. Tolkien's legendarium, the history... | In J. R. R. Tolkien's legendarium, the history... | 605422 | 610 | Articles with short description, Fictional his... | 1123256525 | 5398 |
0 | https://en.wikipedia.org/wiki/History | History | History (from Ancient Greek ἱστορία (historía... | History (from Ancient Greek ἱστορία (historía... | 8628272 | 6617 | All Wikipedia articles written in American Eng... | 1124009190 | 6912 |
38 | https://en.wikipedia.org/wiki/History_of_Lambo... | History of Lamborghini | Automobili Lamborghini S.p.A. is an Italian br... | Automobili Lamborghini S.p.A. is an Italian br... | 960020 | 271 | All articles with unsourced statements, Articl... | 1108125867 | 6995 |
37 | https://en.wikipedia.org/wiki/History_of_marke... | History of marketing | The study of the history of marketing, as a di... | The study of the history of marketing, as a di... | 931211 | 785 | Articles with short description, EngvarB from ... | 1117474637 | 7300 |
21 | https://en.wikipedia.org/wiki/History_of_agric... | History of agriculture | Agriculture began independently in different p... | Agriculture began independently in different p... | 2516191 | 1977 | All articles with dead external links, Article... | 1123034506 | 7546 |
34 | https://en.wikipedia.org/wiki/History_of_Korea | History of Korea | The Lower Paleolithic era in the Korean Penins... | The Lower Paleolithic era in the Korean Penins... | 2743246 | 3470 | Accuracy disputes from December 2017, All accu... | 1124522331 | 9137 |
31 | https://en.wikipedia.org/wiki/History_of_Earth | History of Earth | The history of Earth concerns the development ... | The history of Earth concerns the development ... | 4602476 | 4317 | AC with 0 elements, All articles with unsource... | 1124371678 | 9435 |
23 | https://en.wikipedia.org/wiki/History_of_democ... | History of democracy | A democracy is a political system, or a system... | A democracy is a political system, or a system... | 1972822 | 2383 | All Wikipedia articles needing clarification, ... | 1122481158 | 10521 |
11 | https://en.wikipedia.org/wiki/History_of_banking | History of banking | The history of banking began with the first pr... | The history of banking began with the first pr... | 2333050 | 1619 | All articles with unsourced statements, All ar... | 1116409705 | 10707 |
17 | https://en.wikipedia.org/wiki/History_of_mathe... | History of mathematics | The history of mathematics deals with the orig... | The history of mathematics deals with the orig... | 3315598 | 4317 | All articles needing additional references, Al... | 1122788585 | 10765 |
19 | https://en.wikipedia.org/wiki/History_of_Hinduism | History of Hinduism | The history of Hinduism covers a wide variety ... | The history of Hinduism covers a wide variety ... | 1797592 | 2935 | All articles lacking reliable references, All ... | 1124512840 | 11124 |
2 | https://en.wikipedia.org/wiki/History_of_China | History of China | The earliest known written records of the hist... | The earliest known written records of the hist... | 6608376 | 7269 | All articles containing potentially dated stat... | 1124025086 | 11454 |
4 | https://en.wikipedia.org/wiki/History_of_Japan | History of Japan | The first human inhabitants of the Japanese ar... | The first human inhabitants of the Japanese ar... | 5031470 | 7603 | All Wikipedia articles written in American Eng... | 1123804351 | 11506 |
6 | https://en.wikipedia.org/wiki/History_of_Chris... | History of Christianity | The history of Christianity concerns the Chris... | The history of Christianity concerns the Chris... | 3762801 | 5801 | All articles with bare URLs for citations, All... | 1123072552 | 12372 |
20 | https://en.wikipedia.org/wiki/History_of_Ireland | History of Ireland | The first evidence of human presence in Irelan... | The first evidence of human presence in Irelan... | 2701679 | 3338 | All accuracy disputes, All articles with style... | 1123697361 | 12433 |
33 | https://en.wikipedia.org/wiki/History_of_Afgha... | History of Afghanistan | The history of Afghanistan as a state began in... | The history of Afghanistan as a state began in... | 2155957 | 2971 | All articles with dead external links, Article... | 1121424232 | 12743 |
39 | https://en.wikipedia.org/wiki/History_of_Pakistan | History of Pakistan | The history of the lands of Pakistan for the p... | The history of the lands of Pakistan for the p... | 1806158 | 5054 | All Wikipedia articles written in Pakistani En... | 1123802420 | 12999 |
12 | https://en.wikipedia.org/wiki/History_of_Rome | History of Rome | The history of Rome includes the history of th... | The history of Rome includes the history of th... | 2079609 | 2796 | All articles containing potentially dated stat... | 1123947372 | 14297 |
27 | https://en.wikipedia.org/wiki/History_of_medicine | History of medicine | The history of medicine is both a study of med... | The history of medicine is both a study of med... | 2008806 | 2819 | All articles containing potentially dated stat... | 1123053788 | 14577 |
5 | https://en.wikipedia.org/wiki/History_of_telev... | History of television | The concept of television was the work of many... | The concept of television was the work of many... | 5074930 | 3906 | All articles to be expanded, All articles with... | 1123620788 | 15321 |
18 | https://en.wikipedia.org/wiki/History_of_Bangl... | History of Bangladesh | Civilisational history of Bangladesh previousl... | Civilisational history of Bangladesh previousl... | 1835490 | 2666 | All articles lacking reliable references, All ... | 1123069979 | 16505 |
10 | https://en.wikipedia.org/wiki/History_of_Russia | History of Russia | The history of Russia begins with the historie... | The history of Russia begins with the historie... | 3289180 | 4944 | All articles with minor POV problems, All arti... | 1124023255 | 16810 |
3 | https://en.wikipedia.org/wiki/History_of_Islam | History of Islam | The history of Islam concerns the political, s... | The history of Islam concerns the political, s... | 5703155 | 6107 | All articles with unsourced statements, Articl... | 1123323843 | 17921 |
14 | https://en.wikipedia.org/wiki/History_of_Spain | History of Spain | The history of Spain dates to the pre-Roman pe... | The history of Spain dates to the pre-Roman pe... | 2355382 | 4119 | All accuracy disputes, All articles with dead ... | 1124373367 | 17955 |
44 | https://en.wikipedia.org/wiki/History_of_Ghana | History of Ghana | The Republic of Ghana is named after the medie... | The Republic of Ghana is named after the medie... | 1305934 | 1479 | All Wikipedia articles written in Ghanaian Eng... | 1123589131 | 18356 |
42 | https://en.wikipedia.org/wiki/History_of_Belgium | History of Belgium | The history of Belgium extends before the foun... | The history of Belgium extends before the foun... | 1119414 | 2301 | All articles covered by WikiProject Wikify, Al... | 1124232962 | 19192 |
8 | https://en.wikipedia.org/wiki/History_of_Poland | History of Poland | The history of Poland spans over a thousand ye... | The history of Poland spans over a thousand ye... | 1542336 | 8199 | All articles lacking reliable references, Arti... | 1115808344 | 19658 |
26 | https://en.wikipedia.org/wiki/History_of_Italy | History of Italy | The history of Italy covers the ancient period... | The history of Italy covers the ancient period... | 1824793 | 3518 | All articles with incomplete citations, Articl... | 1124364753 | 20851 |
32 | https://en.wikipedia.org/wiki/History_of_the_U... | History of the United States | The history of the lands that became the Unite... | The history of the lands that became the Unite... | 6916785 | 11165 | All Wikipedia articles needing clarification, ... | 1124535526 | 21447 |
1 | https://en.wikipedia.org/wiki/History_of_Europe | History of Europe | The history of Europe is traditionally divided... | The history of Europe is traditionally divided... | 2764829 | 4533 | All articles needing additional references, Al... | 1123723906 | 25151 |
13 | https://en.wikipedia.org/wiki/History_of_Germany | History of Germany | The concept of Germany as a distinct region in... | The concept of Germany as a distinct region in... | 2741092 | 5078 | All articles lacking reliable references, Arti... | 1124396390 | 32621 |
15 | https://en.wikipedia.org/wiki/History_of_Austr... | History of Australia | The history of Australia is the story of the l... | The history of Australia is the story of the l... | 3782171 | 5856 | AC with 0 elements, All Wikipedia articles wri... | 1123124583 | 35046 |
import seaborn as sns
sns.pairplot(data=df[["text_length", "num_views", "num_edits"]])
<seaborn.axisgrid.PairGrid at 0x116a8da60>