Python Pandas

Posted on 05/09/2018, in Data Science, Python.

This note is used only for noting pandas package in python. You can see also: python note, data note or machine learning note.

In this post

Documentation
Installation
Input
- Input data
Access : loc, iloc
Filtering Pandas DataFrame
.apply()

Documentation

pandas is an open source, BSD-licensed library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language
Official pandas doc (use the search function)
Built first from numpy
10 minutes to pandas

Installation

Go with Anaconda package manager
Usage: import pandas as pd

Input

train.head() (cf) : first 5 rows (default) of the data set, for a quick look in the data. You can use .head(n=<number>) to show <number> rows instead of 5.
train.tail() : last 5 rows
train.info() : show dtype of dataframe
train.describe() (cf) : look on distributions, dispersion, shape of dataset,… to have a general look on the dataset (more understanding on dataset)

Input data

From a dictionary variable, use pd.DataFrame (cf)

# from dictionary
  
names = ['United States', 'Australia', 'Japan', 'India', 'Russia', 'Morocco', 'Egypt']
dr =  [True, False, False, False, True, True, True]
cpc = [809, 731, 588, 18, 200, 70, 45]
  
my_dict = {'country':names, 'drives_right':dr, 'cars_per_cap':cpc}
  
cars = pd.DataFrame(data = my_dict)

cars.index = row_labels : set index for rows instead of automate numbers where row_labels is a list.

From a csv file, use train = pd.read_csv(<file>)
- Using index_col = 0 to hide the automate index.

Access : `loc`, `iloc`

Square brackets
- Column access: cars[['country','capital']]
- Row access: only through slicing cards[1:4]
loc (label based) (cf)
- Row access: cars.loc[['RU', 'USA']]
- Column access: cars.loc[:,'country']
- Row & column: cars.loc[['RU'],['country']
If using single bracket, it’s Panda Series type (pandas.core.series.Series), if using double brackets, it’s Pandas DataFrame type (pandas.core.series.Series)!

iloc: select rows and columns by number (integer-location based indexing) [xem thêm]

# Rows:
data.iloc[0] # first row of data frame (Aleshia Tomkiewicz) - Note a Series data type output.
data.iloc[1] # second row of data frame (Evan Zigomalas)
data.iloc[-1] # last row of data frame (Mi Richan)
  
# Columns:
data.iloc[:,0] # first column of data frame (first_name)
data.iloc[:,1] # second column of data frame (last_name)
data.iloc[:,-1] # last column of data frame (id)
  
# Multiple row and column selections using iloc and DataFrame
data.iloc[0:5] # first five rows of dataframe
data.iloc[:, 0:2] # first two columns of data frame with all rows
data.iloc[[0,3,6,24], [0,5,6]] # 1st, 4th, 7th, 25th row + 1st 6th 7th columns.
data.iloc[0:5, 5:8] # first 5 rows and 5th, 6th, 7th columns of data frame (county -> phone1).

dataset.iloc[:,:-1].values: chọn values của tất cả dòng (:) và tất cả cột trừ cột cuối (:-1)

Filtering Pandas DataFrame

Goal: select a conditional column from a data
- Using Pandas Series, not Pandas DataFrame!
- Comparison: brics["area"] > 8
- Total: brics[brics['area'] > 8]

Boolean operators, need to use np.logical_and or others!

bricks[ np.logical_and(brics['area'] > 8, brics['area'] < 10)]

`.apply()`

Create a new column counting the lenght of elements in another column, we can using .apply function

brics['name_lenght'] = brics['country'].apply(len)

That means we wanna apply the len function to the column country.