DataQuest 2: Step 2 - Pandas and Numpy fundamentals
Posted on 25/09/2018, in Data Science, Python.This note is used for my notes about the Data Scientist path on dataquest. I take this note after I have some basics on python with other notes, that’s why I just write down some new-for-me things.
Mission 289 - Introduction to NumPy
import numpy as np: starts to use numpy<np.ndarray> = np.array(<list-of-lists>)<np.ndarray>.shapegives dimension shape of an array in atupletype (likelistbut not be modified or immutable), e.g.(2,3)- numpy display a
...to summerize an array with a large data ndarray[row, column]ornparray[row]-
Selecting
list of lists ndarray note an element list[1][3]ndarray[1, 3]rows list[:3]ndarray[:3]the same columns [row[4] for row in list]ndarray[:,4]multiple columns [[row[1], row[3]] for row in a]ndarray[:,[1,3]] %timeit -r 1 -n 1 python_subset(): see the time for a single run- numpy quicker than list of list 30 times
- Vectorized operators (elementwise):
+, -, *, /,%(remainder when a divided by b),**,//(floor division, rounding down to the nearest number) - Alternative for
/isnp.divide(<ndarray1>, <ndarray2>) - Max/Min/Mean/Median/Sum:
- Whole array:
<ndarray.min()>ornp.min(<ndarray>). The same formax, mean, median, sum. - Each row:
<ndarray.max(axis=1)> - Each colum:
<ndarray.max(axis=0)>
- Whole array:
- Functions vs Methods:
- Functions act as stand alone segments of code that usually take an input
- Methods are special functions that belong to a specific type of object
np.expand_dims(<ndarray>):zeros = [0 0 0] # zeros.shape is (3,) zeros_2d = np.expand_dims(zeros,axis=0) # zeros_2d.shape is (1,3) # axis=0 w.r.t row # axxis=1 w.r.t column-
numpy.concatenate([<ndarray1>,<ndarray2>],axis=0): combine 2 ndarraysones = [[ 1 1 1] [ 1 1 1]] # shape is (2,3) zeros = [ 0 0 0 ] # 1d zeros_2d = np.expand_dims(zeros,axis=0) # to (1,3) combined = np.concatenate([ones,zeros_2d],axis=0) # result # [[ 1 1 1] # [ 1 1 1] # [ 0 0 0]] -
Sort:
np.argsort(<ndarray>)gives the sorted index of<ndarray>(sorting follows elements in ndarray but gives the result in index)fruit = np.array(['o', 'b', 'a', 'g', 'c']) sorted_order = np.argsort(fruit) # gives [2, 1, 4, 3, 0] fruit[sorted_order] # gives np.array(['a', 'b', 'c', 'g', 'o'])NumPy only supports sorting in ascending order
Can sort one column of an array then apply to the whole by
nd.array[sorted_order]
Mission 290 - Boolean Indexing with NumPy
taxi = np.genfromtxt('nyc_taxis.csv', delimiter=',', skip_header=1): reads a text file into a NumPy ndarray.<ndarray>.dtype: see the internal datatype that has been usedtaxi = taxi[1:]removes the first row.np.array([2,4,6,8]) < 5gives a boolean array-
ndarray[<bool array>]give a new ndarray but only at True indexes of bool array (remove False indexes)ndadday # 4x3 bool1 = [True False True True] bool2 = [True False True] ndarray[bool1] # takes the rows ndaary[:,bool2] # takes the columns c[c[:,1] > 2, 1] = 99-
ndarray.shapegives a tuple(#rows, #columns).ndarray.shape[0]gives number of rows
Mission 291 - Introduction to Pandas
import pandas as pdf500 = pd.read_csv("<file-name>", index_col=0)type(f500)givespandas.core.frame.DataFramef500.shapegives a tuple(500, 16)
- Series is the pandas type for one-dimensional objects (columns, the 1st column is alway the header of pd obj)
- 1D pd object $\Rightarrow$ series
- 2D pd object $\Rightarrow$ dataframe
- Unlike NumPy, pandas does not use the same type for 1D and 2D arrays.
f500.dtypesgives type of each column-
f500.head()returns first 5 rowsf500.head(10) returns first 10 rows
f500.tail()- Unlike NumPy, pandas does not use the same type for 1D and 2D arrays.
- Select single element:
df.loc["Sinopec Group", "revenues"] - Select rows:
- Single:
df.loc["a"] - List:
df.loc[["a","b"]] - Slice:
df["a":"c"]ordf.loc["a":"c"]
- Single:
- Select columns:
- Single:
df["a"]ordf.loc[:,"a"]-
df["a"]gives a df butdf.loc[:,"a"]gives a series
-
- List:
df[["a", "b"]]ordf.loc[:,["a","b"]] - Slice:
df.loc[:, ["a":"c"]]
- Single:
- Select rows in series object
- Single:
s["a"]ors.loc["a"] - List:
s[["a","b"]]ors.loc[["a","b"]] - Slice:
s["a":"c"]ors.loc["a":"c"]
- Single:
-
s.describe(): gives count, mean, std, min, 25%, 50%, 75%, max or other types. (method chaining)revs = f500["revenues"] print(revs.describe()) - There is also
dataframe.describe() - for all:
all_desc = f500.describe(include='all') - for only numeric:
print(f500.describe(include=['O'])) - Methods:
.max(),.min(),.mean(),.median(),.mode(),.sum()applied to both series and dataframe.df.<method>(axis=0)ordf.<method>(axis="index")calculate along the row axis. defaultdf.<method>(axis=1)ordf.<method>(axis="column")calculate along the column axis.
medians = f500[["revenues", "profits"]].median(axis=0)
# we could also use .median(axis="index")
# or without axis=0 because it's default
s.value_counts(): displays each unique non-null value from a series, with a count of the number of times that value is used.s.value_counts(dropna=True)exclude null values when making calculations.s.value_counts(dropna=False)includes also the null (normally, it doesn’t)s.value_counts(normalize=True)use percentages instead of counts
-
Top 5 most common values of a column
top5_countries = f500["country"].value_counts().head() df.max(numeric_only = True)only display the numeric columns-
We can assign values in pandas using selecting tools above (like in numpy)
top5_rank_revenue["revenues"] = 0 # whole columns change to 0 top5_rank_revenue.loc["Sinopec Group", "revenues"] = 999 # 1 item -
boolean indexing:
s_bool = df["<column>"] = 8 # gives a boolean series. result = df[s_bool] # apply to whole df, DON'T use .loc result_name = df.loc[s_bool, "<column>"] # consider a column, USE .loc[] -
Coupling
f500.loc[f500["sector"] == "abc"] = "ABC" np.nan= NaN
Mission 292 - Exploring Data with pandas
import pandas as pd
import numpy as np
- pandas uses NumPy objects behind the scenes to store the data.
.loc[]vs.iloc[]- loc: label based selection
- iloc: integer position based selection
- Select:
- an element:
df.iloc[2,0] - row:
df.iloc[1] - column:
df.iloc[:, 1] - slices:
df.iloc[1:5],df.iloc[[1,3], 1:5'] - list:
df.iloc[:,[1,2,5]]
- an element:
- Slice:
- with
.loc[], the ending slide is included - with
.iloc[], the dening slice is not included
- with
- The same for series but don’t forget that series is 1-D
- Import data (cf) ~~~ python # we want to use the 1st column as the row labels f500 = pd.read_csv(“f500.csv”, index_col=0) # remove the index name (text in the first line, first column) f500.index.name = None ~~~
- Sort the rows of
f500by columnsemployees(it returns another df but does not change the df itself)sorted_emp = f500.sort_values(by=["employees"], ascending=False) s.str.contains("<str>"): check if str is in s or nots.str.contains("<str>", regex = False)if we want to consider str as a string.
s.str.endswith("<str>"): check if s ends with str or nots.str.startswith("<str>")-
s.isnull()ors.notnull(): check s contains NaN or nullrev_change_null = f500[f500["revenue_change"].isnull()] print(rev_change_null[["company","country","sector"]]) -
Important about selecting
previously_ranked["rank","revenues"] # error previously_ranked[["rank","revenues"]] # columns previously_ranked.loc["rank","revenues"] # rows previously_ranked.loc[:, ["rank","revenues"]] # columns previously_ranked.loc["rank"] # dataframe previously_ranked["rank"] # series previously_ranked.loc["rank"] - previously_ranked.loc["prev_rank"] # 2 different columns (dataframe) previously_ranked["rank"] - previously_ranked["prev_rank"] # 1 column (series) -
Using boolean operators
combined = (f500_sel["revenues"] > 265000) & (f500_sel["country"] == "China") combined = over_265 & china final_cols = ["company", "revenues"] result = f500_sel[combined, final_cols]or just one line
result = f500_sel.loc[(f500_sel["revenues"] > 265000) & (f500_sel["country"] == "China"), final_cols] - Comparision operators:
==,~(a==b)(not equal) - Panda index alignment. If a dataframe
foodand a seriescolorshave the same index (but diff order), they can be couple withfood["color"] = colors- Discards any items that have an index that doesn’t match the dataframe
- Loops in df: loop over a dataframe, it returns the column index labels, rather than the rows as we might expect.
s.unique()returns an array of unique values from any series-
Find the average revenue for each unique country in
f500# Create an empty dictionary to store the results avg_rev_by_country = {} # Create an array of unique countries countries = f500["country"].unique() # Use a for loop to iterate over the countries for c in countries: # Use boolean comparison to select only rows that # correspond to a specific country selected_rows = f500[f500["country"] == c] # Calculate the mean average revenue for just those rows mean = selected_rows["revenues"].mean() # Assign the mean value to the dictionary, using the # country name as the key avg_rev_by_country[c] = mean
Mission 294 - Guided Project: Exploring Ebay Car Sales Data
s.sort_index(ascending=False)to sort index of seriess.sort_values(): to sort values- See first 10 letters of each row on each column:
autos["ad_created"].str[:10] df[(df["col"] > x ) & (df["col"] < y )]=df[df["col"].between(x,y)]to take the values of column “col” betweenxandy.s.value_counts().sort_index(ascending=True): sort values ofsfor easier examination.autos = autos[autos["price"].between(1,351000)]: only take values between an amount.- element-wise logical comparison:
&(and),==(equal),|(or),~(not) - number of current examined elements:
autos.shape[0] - Find how many “less or greater” in percent:
(~autos["registration_year"].between(1900, 2016)).sum() / autos.shape[0] - Many ways to select rows in a dataframe that fall within a value range for a column.
autos = autos[autos["registration_year"].between(1900, 2016)] - A series will present a (hide) column
indexand a columnvalue. We can uses.indexto show the indexes of this series. - Combine the data from both series objects into a single dataframe
-
Quickly create a
# pandas series: s = pd.Series([True, True, False, True]) # or from a dictionary abc s = pd.Series(abc) # key in dict becomes the index in series # pandas dataframe: df = pd.DataFrame(np.random.randn(6,4),index=dates,columns=list('ABCD')) # or from a series df = pd.DataFrame(s, columns=["<name>"]) # column name will be set to 0 by default - Add many series into a df: convert 1 series to df and then add other series to this df as new columns.