6. Representing data as tables¶
In this lecture we demonstrate:
- how to represent tables using lists;
- how to represent tables using a more efficient model provided by the pandas library;
- how to visualize data represented as tables; and
- how to read tables from local files and remote resources.
6.1. Representing tables by lists¶
The most efficient way to organize large amounts of data is to put them in a table. For example, this table contains some basic facts about a group of students:
Name | Sex | Age (yrs) | Weight (kg) | Height (cm) |
---|---|---|---|---|
Anne | f | 13 | 46 | 160 |
Ben | m | 14 | 52 | 165 |
Colin | m | 13 | 47 | 157 |
Diana | f | 15 | 54 | 165 |
Ethan | m | 15 | 56 | 163 |
Fred | m | 13 | 45 | 159 |
Gloria | f | 14 | 49 | 161 |
Hellen | f | 15 | 52 | 164 |
Ian | m | 15 | 57 | 167 |
Jane | f | 13 | 45 | 158 |
Kate | f | 14 | 51 | 162 |
We shall represent each row in this table as a list, and then put all the these lists into a new list:
students = [["Anne", "f", 13, 46, 160],
["Ben", "m", 14, 52, 165],
["Colin", "m", 13, 47, 157],
["Diana", "f", 15, 54, 165],
["Ethan", "m", 15, 56, 163],
["Fred", "m", 13, 45, 159],
["Gloria", "f", 14, 49, 161],
["Hellen", "f", 15, 52, 164],
["Ian", "m", 15, 57, 167],
["Jane", "f", 13, 45, 158],
["Kate", "f", 14, 51, 162]]
We can now easily retrieve the information about any student in the group. For example, to see the information about Ethan we simply type students[4]
(Important! The first element of the list in indexed by 0, so Ethan's list has index 4):
students[4]
On the other hand, this way of representing tables in inconvenient because it is not easy to work with the columns of the table. For example, if we wanted to compute the average height of the students in the group we would have to write a program:
sum = 0
for s in students:
sum += s[4]
float(sum) / len(students)
This how the code does the job:
- we first initialize the variable
sum
to 0 (eventually it will contain the sum of the heights of the students); - the code then loops through each student in the group and adds its height to
sum
(the height of a student is the numer at the index 4 of the list); - finally, we compute the average.
This is not hard but also not very convenient. We need a more flexible data structure.
6.2. pandas, DataFrame and working with the columns of the table¶
The pandas library has been developed with efficient manupulation of tables in mind. Let's import the library and give it a nickname:
import pandas as pd
This library provides a powerful data structure called DataFrame. Let us start by converting the data on the group of students into a DataFrame:
students_df = pd.DataFrame(students)
This command simply rearranges the data from the list students
into a DataFrame. Let's see what a DataFrame looks like:
students_df
All the information is there, arranged in a neat table. To make the table more readable we shall name the columns of the table using the option columns
:
students_df.columns=["Name", "Sex", "Age", "Weight", "Height"]
students_df
Now that each column has its name we can access the individual colums like this:
students_df["Name"]
students_df["Height"]
You can always get the name of all the columns in the table:
students_df.columns
The following functions make it easy for us to compute the elementary statistics on columns of a DataFrame:
sum
-- the sum of the elements in a column;mean
-- the average value (arithmetic mean) of the elements in a column;median
-- the median value of the elements in a column;min
-- the minimal value of the elements in a column;max
-- the maximal value of the elements in a column.
For example, the height of the shortest student is:
students_df["Height"].min()
The age of the oldest student:
students_df["Age"].max()
The average height of students:
students_df["Height"].mean()
The medial height:
students_df["Height"].median()
Can the entire group fit in an elevator which holds at most 600 kg?
if students_df["Weight"].sum() <= 600:
print("Yes")
else:
print("No")
6.3. Visualizing data represented by DataFrames¶
This is easy: just select the columns you would like to visualize. To see a few examples we start by loading our favourite library:
import matplotlib.pyplot as plt
To visualize the height of students in the group we simply plot a bar chart which displays names and heights of students:
plt.figure(figsize=(10,5))
plt.bar(students_df["Name"], students_df["Height"])
plt.title("The height of students")
plt.show()
plt.close()
We can also visualize the age and weight of the students:
plt.figure(figsize=(10,5))
plt.bar(students_df["Name"], students_df["Weight"], color="r", label="Weight")
plt.bar(students_df["Name"], students_df["Age"], color="g", label="Age")
plt.title("Age and weight of the students")
plt.legend()
plt.show()
plt.close()
6.4. Loading tables from local CSV files¶
The most time-consumig and the least appealing job in data analysis is turning raw data into tables. You may not have noticed that because the tables we have been using for demonstration were small, but in real life situations we usually get huge tables with thousands of rows and columns. (Have you ever heard of a part of data science called Big Data?)
Today we use machines to collected data (automatic meteorogical stations, computerized polling stations etc). The machines then package the data they have collected into some very simple format that we can later load and process. CSV stands for "comma separated values" and is one of the simplest (and most popular) formats to represent tables. CSV files are text files whose rows correspond to the rows of the table, and the values within the same row are separated by commas ("comma separated values").
For example, there is a file Top 25 YouTubers.csv in folder data that looks like this:
RANK,GRADE,NAME,VIDEOS,SUBSCRIBERS,VIEWES
1,A++,T-Series,13629,105783888,76945588449
2,A,PewDiePie,3898,97853589,22298927681
3,A+,5-Minute Crafts,3341,58629572,14860695079
4,A++,Cocomelon - Nursery Rhymes,441,53163816,33519273951
...
25,A,TheEllenShow,10542,33362512,16519572219
This table contains information about top 25 YouTube channels according to the number of subscribers on July 1st, 2019. The first row of the file is the header row which tells us that the table has six columns (RANK, GRADE, NAME, VIDEOS, SUBSCRIBERS, VIEWES). The row
4,A++,Cocomelon - Nursery Rhymes,441,53163816,33519273951
means that the 4th most popular YouTube channel (on July 1st, 2019) has YouTube rank A++, its name is "Cocomelon - Nursery Rhymes", has published 441 videos on YouTube, has 53,163,816 subscribers and 33,519,273,951 views.
The function read_csv
from pandas
reads a CSV file and turns it into a DataFrame. For example:
import pandas as pd
Top25 = pd.read_csv("data/Top 25 YouTubers.csv")
We can peek into the table using the function head(N)
which displays the first N rows of a table. If we invoke the function without the argument it displays the first five rows:
Top25.head()
Top25.head(10)
The function tail(N)
displays the last N rows of the table (or only 5 if no arguemtn is provided):
Top25.tail()
Top25.tail(7)
Let us visualize the number of subsrcibers of the top 25 YouTube channels:
plt.figure(figsize=(15,10))
plt.bar(Top25["NAME"], Top25["SUBSCRIBERS"])
plt.title("Top 25 YouTube channels according to the number of subscribers (July 1st, 2019)")
plt.show()
plt.close()
It's a mess because the names of the channels are too long. In these situations using barh
instead of bar
can produce better results:
plt.figure(figsize=(10,10))
plt.barh(Top25["NAME"], Top25["SUBSCRIBERS"])
plt.title("Top 25 YouTube channels according to the number of subscribers (July 1st, 2019)")
plt.show()
plt.close()
6.5. Loading tables from remote resources¶
You can also load data from a remote resource directly, without the need to download the the data to your machine. In order to load a table from a remote resource you need an Internet connection (of course!) and you have to know the exact location of the table you want to load. The exact location of anu Internet resouce is provided by its URL (which stands for Universal Resource Locator).
The "Internet address"
https://raw.githubusercontent.com/cs109/2014_data/master/countries.csv
contains a publicly available list of all the countries in the world. We can read this table directly using read_csv
:
countries = pd.read_csv("https://raw.githubusercontent.com/cs109/2014_data/master/countries.csv")
countries.head(10)
The tables available on the Internet are usually not conveniently pre-packaged into a CSV file. They are most often embedded into a web page and surounded by some extra HTML code (HTML stands for "hyper-text markup language" and denotes the language all the web-pages on Earth are coded into).
Fortunately, the function read_html
can take a web-page and extract a table from it. For example, the following code reads the list of all the US states into a DataFrame directly from a Wikipedia web-page:
US = pd.read_html("https://simple.wikipedia.org/wiki/List_of_U.S._states", header=[0,1])[0]
Web pages are complex and can contain several tables. This is why read_html
returns a list of tables. In this case the table we need is the first in the list, hence the indexing [0]
. The option header=[0,1]
means that the first two rows of the table make up the header. Let's peek:
US.head()
6.6. Exercises¶
Exercise 1. Look at the code and answer the questions below:
import pandas as pd
import matplotlib.pyplot as plt
students = [["Anne", "f", 13, 46, 160],
["Ben", "m", 14, 52, 165],
["Colin", "m", 13, 47, 157],
["Diana", "f", 15, 54, 165],
["Ethan", "m", 15, 56, 163],
["Fred", "m", 13, 45, 159]]
table = pd.DataFrame(students)
table.columns=["Name", "Sex", "Age", "Weight", "Height"]
print(table["Height"].min(), table["Height"].mean(), table["Height"].median())
plt.figure(figsize=(10,5))
plt.bar(table["Name"], table["Height"])
plt.show()
plt.close()
temp_anomalies = pd.read_csv("data/TempAnomalies.csv", header=None)
countries = pd.read_csv("https://raw.githubusercontent.com/cs109/2014_data/master/countries.csv")
US = pd.read_html("https://simple.wikipedia.org/wiki/List_of_U.S._states", header=[0,1])[0]
- Why do we have to import two libraries?
- Why do we represent the tables by DataFrames, and not lists?
- What are the names of the columns in the DataFrame
table
? - What is the difference between
min
,mean
andmedian
? - What do you have to change to make the chart depict the weight of students?
- How would you load the file
TempAnomalies.csv
if it had a header row? - Is the DataFrame
countries
created from the information stored locally on your computer? - What is the format of the table the DataFrame
US
was created from? - What happens if you type
https://simple.wikipedia.org/wiki/List_of_U.S._states
into the browser?
Exercise 2. The table below depicts the production of corn a wheat in an agricultural region of Europe (the numbers are in 1000 tons):
Year | Corn | Wheat |
---|---|---|
2008 | 6,158 | 2,095 |
2009 | 6,396 | 2,067 |
2010 | 7,207 | 1,631 |
2011 | 6,480 | 2,076 |
2012 | 3,532 | 1,911 |
(a) Turn this table into a list and create a DataFrame from the list. Then compute the maximal yield of corn and the minimal yield of wheat.
(b) Visualize the data using a bar chart.
(c) Compute the average yield of corn; then compute the difference between the maximal and the minimal yield of wheat.
Exercise 3. This is an overview of spendings of a family over a year (in the local currency):
Item | Jan | Feb | Mar | Apr | May | Jun | Jul | Aug | Sep | Oct | Nov | Dec |
---|---|---|---|---|---|---|---|---|---|---|---|---|
Rent | 8,251 | 8,436 | 8,524 | 8,388 | 8,241 | 8,196 | 8,004 | 7,996 | 7,991 | 8,015 | 8,353 | 8,456 |
Electricity | 4,321 | 4,530 | 4,115 | 3,990 | 3,985 | 3,726 | 3,351 | 3,289 | 3,295 | 3,485 | 3,826 | 3,834 |
Phone (landline) | 1,425 | 1,538 | 1,623 | 1,489 | 1,521 | 1,485 | 1,491 | 1,399 | 1,467 | 1,531 | 1,410 | 1,385 |
Phone (cell) | 2,181 | 2,235 | 2,073 | 1,951 | 1,989 | 1,945 | 3,017 | 2,638 | 2,171 | 1,831 | 1,926 | 1,833 |
TV and Internet | 2,399 | 2,399 | 2,399 | 2,399 | 2,399 | 2,399 | 2,399 | 2,399 | 2,399 | 2,399 | 2,399 | 2,399 |
Transport | 1,830 | 1,830 | 1,830 | 1,830 | 1,950 | 1,950 | 1,450 | 1,450 | 1,950 | 1,950 | 2,050 | 2,050 |
Food | 23,250 | 23,780 | 24,019 | 24,117 | 24,389 | 24,571 | 24,736 | 24,951 | 25,111 | 25,389 | 25,531 | 25,923 |
Rest | 4,500 | 3,700 | 5,100 | 3,500 | 2,750 | 4,250 | 7,320 | 8,250 | 3,270 | 4,290 | 3,200 | 8,390 |
This table represented as a list looks like this:
spendings = [
["Rent", 8251, 8436, 8524, 8388, 8241, 8196, 8004, 7996, 7991, 8015, 8353, 8456],
["Electricity", 4321, 4530, 4115, 3990, 3985, 3726, 3351, 3289, 3295, 3485, 3826, 3834],
["Landline", 1425, 1538, 1623, 1489, 1521, 1485, 1491, 1399, 1467, 1531, 1410, 1385],
["Cell", 2181, 2235, 2073, 1951, 1989, 1945, 3017, 2638, 2171, 1831, 1926, 1833],
["TV and Internet", 2399, 2399, 2399, 2399, 2399, 2399, 2399, 2399, 2399, 2399, 2399, 2399 ],
["Transport", 1830, 1830, 1830, 1830, 1950, 1950, 1450, 1450, 1950, 1950, 2050, 2050],
["Food", 23250, 23780, 24019, 24117, 24389, 24571, 24736, 24951, 25111, 25389, 25531, 25923],
["Rest", 4500, 3700, 5100, 3500, 2750, 4250, 7320, 8250, 3270, 4290, 3200, 8390]
]
(a) Turn this into a DataFrame.
(b) Compute the total spendings per month (how much in January, how much in February, etc).
Exercise 4. The file data/LEB.csv contains the overall Life Expectancy at Birth of humans computed in five-year periods.
(a) Load this file into a DataFrame.
(b) Visualize the data.
Exercise 5. At a PE lecture the students were practising long jumps. Each student had three tries and the data is collected in LongJump.csv available in the folder data. The first row of the file is the header.
(a) Load this file into a DataFrame.
(b) Visualize the data by a bar chart. (Hint: Be creative.)
Exercise 6. At the following URL
https://raw.githubusercontent.com/resbaz/r-novice-gapminder-files/master/data/gapminder-FiveYearData.csv
you can find a publicly available table with the list of countries in the world and some parameters of economic developemnt of the states summarized in five-year intervals. The table has the following columns:
- country
- year
- pop (= population)
- continent
- lifeExp (= life expextancy; how many years, on average, is a citizen of that country expected to live)
- gdpPercap (= GDP per capitem; how much money, on average is US dollars, do citizens of that country earn)
Load this table into a DataFrame and display the first 20 rows, as well as the last 10 rows of the table.
Exercise 7. Load the table from the following URL
https://www.worldometers.info/world-population/population-by-country/
into a DataFrame and display the first few rows of the table.