DataQuest 3: Step 2 - Data Visualization (Exploratory & Stotytelling)

Posted on 10/10/2018, in Data Science, Python.

This note is used for my notes about the Data Scientist path on dataquest. I take this note after I have some basics on python with other notes, that’s why I just write down some new-for-me things.

Go back to Dataquest 2: Step 2 - Pandas and Numpy fundamentals.

See many other plot types.

In this post

Mission 142: Line Charts
Mission 143: Multiple plots
Mission 144: Bar Plots And Scatter Plots
Mission 145: Histograms And Box Plots
Mission 146: Guided Project: Visualizing Earnings Based On College Majors
Mission 147: Improving Plot Aesthetics
Mission 148: Color, Layout, and Annotations
Mission 149: Guided Project: Visualizing The Gender Gap In College Degrees
Mission 152: Conditional Plots (Titanic)
Mission 150: Visualizing Geographic Data
More tools

Mission 142: Line Charts

Download mission 142.

How the Government Measures Unemployment.

matplotlib ref.

import pandas as pd
import matplotlib.pyplot as plt

pd.to_datetime(<series>): converts strings inside a series to a datetime objects
- Extract month: unrate['DATE'].dt.month returns series containing int
plot nothing then default axus range is [-0.06, 0.06]
Plot by plt.plot(x,y) and the show it by plt.show()
Change, rotate,… ticks: plt.xticks(rotation=90)
Plot info: plt.xlabel(), plt.ylabel(), plt.title() (title of the plot)

Mission 143: Multiple plots

Download mission 143.

Add subplot (top-left to bottom-right) (multi plots in the same figure)

import matplotlib.pyplot as plt
fig = plt.figure()
ax1 = fig.add_subplot(2,1,1) # 2rows, 1cols, plot_number 1
ax2 = fig.add_subplot(2,1,2)
plt.show()

Use ax1.plt(x, y) as usual plt.plot

Change size of figure fig: plt.figure(figsize=(width_inch, height_inch))
Use many times plt.plot() to draw multi function on the same plot.
Use plt.plot(x, y, c="red") to choose a color for plot. See more here.
plt.plot(x, y, label="<label>") legend and then plt.legend(loc='upper left'), read more here.

Note: there is diff between

x_axis_48 = unrate.loc[0:12,["MONTH"]] # wrong
x_axis_48 = unrate[0:12]["MONTH"] # right

Mission 144: Bar Plots And Scatter Plots

Download mission 144.

fig, ax = plt.subplots()

Barplot positioning

# Positions of the left sides of the 5 bars. [0.75, 1.75, 2.75, 3.75, 4.75]
from numpy import arange
bar_positions = arange(5) + 0.75

# Heights of the bars.  In our case, the average rating for the first movie in the dataset.
num_cols = ['RT_user_norm', 'Metacritic_user_nom', 'IMDB_norm', 'Fandango_Ratingvalue', 'Fandango_Stars']
bar_heights = norm_reviews[num_cols].iloc[0].values

ax.bar(bar_positions, bar_heights)

the width of each bar is set to 0.8 by default.

plt.subplot(): generate single subplot + returns both figure and axes

Bar plot example

num_cols = ['RT_user_norm', 'Metacritic_user_nom', 'IMDB_norm', 'Fandango_Ratingvalue', 'Fandango_Stars']
bar_heights = norm_reviews[num_cols].iloc[0].values
bar_positions = arange(5) + 0.75
tick_positions = range(1,6)
width = 0.5;
fig, ax = plt.subplots()

ax.bar(bar_positions, bar_heights, width)
ax.set_xticks(tick_positions)
ax.set_xticklabels(num_cols, rotation=90)

ax.set_xlabel('Rating Source')
ax.set_ylabel('Average Rating')
ax.set_title('Average User Rating For Avengers: Age of Ultron (2015)')
plt.show()

ax.bar: verticle bar plot.
ax.barh(bar_positions, bar_widths): create a horizontal bar plot
- bar_positions : specify the y coordinate for the bottom sides
- bar_widths: like bar_heights
Note that: there is diff between pos of ax.barh or ax.bar before or after ax.set_xticks or ax.set_yticks. It should be before!
ax.set_xticklabels() takes in a list.

Scatter plot example

ax.scatter(x, y): scatter plot (plot with dot)

fig, ax = plt.subplots()
ax.scatter(norm_reviews["Fandango_Ratingvalue"], norm_reviews["RT_user_norm"])
ax.set_xlabel("Fandango")
ax.set_ylabel("Rotten Tomatoes")
plt.show()

ax.set_xlim(0, 5) and .set_ylim to set the data limits for both axes. We also zoom in the plot with these commands.
By manually setting the data limits ranges to specific ranges for all plots, we're ensuring that we can accurately compare.

Mission 145: Histograms And Box Plots

Download mission 145.

Using s.value_counts() to display the frequency distribution.
Don’t forget to s.sort_index() to sort the indexes.

Histogram example

Diff between histogram and bar plots
- hist describes continuous values while bar plots descibes discrete
- there is no space between each bin
- location of bars on x-axis matter in hist but not in bar plot.
```
# Either of these will work.
ax.hist(norm_reviews['Fandango_Ratingvalue'], 20, range=(0, 5))
ax.hist(norm_reviews['Fandango_Ratingvalue'], bins=20, range=(0, 5))
```

Boxplot intro

Boxplot illustration

quartile is diff from histogram because it divides the range of values into four regions where each region contains 1/4th of the total values.
- While histograms allow us to visually estimate the percentage of ratings that fall into a range of bins.
- To visualize quartiles, we need to use a box plot (box-and-whisker plot)
- quartile is a case of quantile which divides the range of values into many equal value regions.
```
ax.boxplot(norm_reviews["RT_user_norm"])
# multi boxplots
num_cols = ['RT_user_norm', 'Metacritic_user_nom', 'IMDB_norm', 'Fandango_Ratingvalue', 'Fandango_Stars']
ax.boxplot(norm_reviews[num_cols].values)
```

Mission 146: Guided Project: Visualizing Earnings Based On College Majors

Ref solution of project 146.

Do students in more popular majors make more money? $\Rightarrow$ Using scatter plots
How many majors are predominantly male? Predominantly female? $\Rightarrow$ Using histograms
Which category of majors have the most students? $\Rightarrow$ Using bar plots
Jupyter’s magic % matplotlib inline so that plots are displayed inline
Use df.iloc[] to return the first row formatted as a table.
Use df.head() and DataFrame.tail() to become familiar with how the data is structured.
Use df.describe() to generate summary statistics for all of the numeric columns.
Number of rows : df.shape[0] or number of columns df.shape[1]

Pandas has it own a plot method, see here.

recent_grads.plot(x='Sample_size', y='Employed', kind='scatter')
# 'Sample_size' and 'Employed' are columns of recent_grads

We can use one-line code because of % matplotlib inline (of Jupyter)

We can use s.plot()

We can control the number of bins in series by s.hist

recent_grads['Sample_size'].plot(kind="hist", rot=40) # rot = rotation
recent_grads['Sample_size'].hist(bins=25, range=(0,5000))

scatter_matrix() can plot scatter and hist an the same time for 2 columns (ref)

  from pandas.plotting import scatter_matrix
  scatter_matrix(recent_grads[['Women', 'Men']], figsize=(10,10))

Boxplot intro

Normally, when using bar plot, we need to specify the locations, labels, lenghts and widths of the bars but pandas helps us do that

# first 5 values in 'Women' column
recent_grads[:5]['Women'].plot(kind='bar')

# or better with labels
recent_grads[:5].plot.bar(x='Major', y='Women')

See many other plot types.

Mission 147: Improving Plot Aesthetics

Download mission 147.

2 line plot in the same figure

fig, ax = plt.subplots()
ax.plot(women_degrees['Year'], women_degrees['Biology'], label='Women')
ax.plot(women_degrees['Year'], 100-women_degrees['Biology'], label='Men')

ax.tick_params(bottom="off", top="off", left="off", right="off")
ax.set_title('Percentage of Biology Degrees Awarded By Gender')
ax.legend(loc="upper right")

The data-ink ratio is the proportion of Ink that is used to present actual data compared to the total amount of ink (or pixels) used in the entire display. (Ratio of Data-Ink to non-Data-Ink). The goal is to design a display with the highest possible data-ink ratio.

Modify/Remove ticks on axes & remove axes spine:

ax.tick_params(bottom="off", top="off", left="off", right="off", labelbottom='off')
# Add your code here
ax.spines["right"].set_visible(False)
ax.spines["left"].set_visible(False)

or shorter for spines

for key,spine in ax.spines.items():
  spine.set_visible(False)

Mission 148: Color, Layout, and Annotations

Download mission 148.

The Color Blind 10 palette contains ten colors that are colorblind friendly.

Matplotlib expects each value to be scaled down and to **range between 0 and 1 **(not 0 and 255 of RGB color).

cb_dark_blue = (0/255,107/255,164/255)
ax.plot(women_degrees['Year'], women_degrees['Biology'], label='Women', c=cb_dark_blue)

Line width: ax.plot(linewidth=2)
Annotation: ax.text(<x>, <y>, <text>)

Big figure fig and then inside there are multiple plot ax

fig = plt.figure(figsize=(18, 3))
ax = fig.add_subplot(1,6,sp+1)

Mission 149: Guided Project: Visualizing The Gender Gap In College Degrees

Ref solution of project 149.

ax.set_yticks([0,100]): just keep 0 and 100 on the y ticks, or xticks
transparency: ax.plot(c="red", alpha=0.3)
Horizontal line in the figure: ax.axhline(<startpoing>)
Note that, all coordinate in the plot is the coordinate of the axes (depends on the data)

Save the plots (must be used before plt.show())

plt.plot(women_degrees['Year'], women_degrees['Biology'])
plt.savefig('biology_degrees.png')
plt.show()

Mission 152: Conditional Plots (Titanic)

Download mission 152.

What you need to know for now is that the resulting line is a smoother version of the histogram, called a kernel density plot. Kernel density plots are especially helpful when we’re comparing distributions.

Plot kernel density and histogram together.

import seaborn as sns
import matplotlib.pyplot as plt
sns.distplot(titanic["Fare"])
plt.show()

Plot only the kernel density plot
```
sns.kdeplot(titanic["Age"])
```
Style of seaborn (default is darkgrid),
Remove axes spines in seaborn: sns.despine(), default, it removes top right, if one wants to remove bottom left, use sns.despine(left=True, bottom=True)
```
sns.set_style("white")
sns.kdeplot(titanic["Age"], shade=True)
sns.despine(bottom=True, left=True)
plt.xlabel("Age")
plt.show()
```

Small multiple plots (automaticallt with seaborn)

# Condition on unique values of the "Survived" column.
g = sns.FacetGrid(titanic, col="Survived", size=6)
# For each subset of values, generate a kernel density plot of the "Age" columns.
g.map(sns.kdeplot, "Age", shade=True)

“facet” likes “subset”
col="Survived", number of subplots depend on the number of unique values in this column
size=6, 6 inch height for each subplot.

Two conditions (col for condition “Survived”, row for condition “Pclass”)

g = sns.FacetGrid(titanic, col="Survived", row="Pclass")
g.map(sns.kdeplot, "Age", shade=True)
sns.despine(left=True, bottom=True)
plt.show()

Three conditions (use hue) (overlap plot with the previous ones)

g = sns.FacetGrid(titanic, col="Survived", row="Pclass", hue="Sex")
g.map(sns.kdeplot, "Age", shade=True)
g.add_legend() # add legend for seaborn sns
sns.despine(left=True, bottom=True)
plt.show()

3 conditions

Mission 150: Visualizing Geographic Data

Download mission 150.

basemap toolkit: Basemap is an extension to Matplotlib that makes it easier to work with geographic data.
```
# install
conda install basemap
# or
conda install -c conda-forge basemap
```
You’ll want to import matplotlib.pyplot into your environment when you use Basemap

Create a new basemap instance

import matplotlib.pyplot as plt
from mpl_toolkits.basemap import Basemap
m = Basemap(projection='merc',llcrnrlat=-80,urcrnrlat=80,llcrnrlon=-180,urcrnrlon=180)

Convert the longitude values from spherical to Cartesian

longitudes = airports["longitude"].tolist()
latitudes = airports["latitude"].tolist()
x, y = m(longitudes, latitudes)

A scatter plot in basemap: m.scatter() with the same parameters as in plt.scatter().

m = Basemap(projection='merc', llcrnrlat=-80, urcrnrlat=80, llcrnrlon=-180, urcrnrlon=180)
longitudes = airports["longitude"].tolist()
latitudes = airports["latitude"].tolist()
x, y = m(longitudes, latitudes)
m.scatter(x, y, s=1) # size of dot is 1
m.drawcoastlines() # show the coastlines
plt.show()

A great circle between 2 points on a map. In 2D, it’s a line.

fig, ax = plt.subplots(figsize=(15,20))
m = Basemap(projection='merc', llcrnrlat=-80, urcrnrlat=80, llcrnrlon=-180, urcrnrlon=180)
m.drawcoastlines()

def create_great_circles(df):
    for idx, s in df.iterrows():
        if abs(s["end_lat"] - s["start_lat"] < 180) and abs(s["end_lon"] - s["start_lon"] < 180):
            m.drawgreatcircle(s["start_lon"], s["start_lat"], s["end_lon"], s["end_lat"])
            
dfw = geo_routes.loc[geo_routes["source"]=="DFM"]
create_great_circles(dfw)
plt.show()

More tools

Creating 3D plots using Plotly
Creating interactive visualizations using bokeh
Creating interactive map visualizations using folium

Go to Dataquest 4: Step 2 - Data Cleaning.