Python Pandas
Posted on 05/09/2018, in Data Science, Python.This note is used only for noting pandas package in python. You can see also: python note, data note or machine learning note.
tocIn this post
Documentation
- pandas is an open source, BSD-licensed library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language
- Official pandas doc (use the search function)
- Built first from
numpy
- 10 minutes to pandas
Installation
- Go with Anaconda package manager
- Usage:
import pandas as pd
Input
train.head()
(cf) : first 5 rows (default) of the data set, for a quick look in the data. You can use.head(n=<number>)
to show<number>
rows instead of 5.train.tail()
: last 5 rowstrain.info()
: show dtype of dataframetrain.describe()
(cf) : look on distributions, dispersion, shape of dataset,… to have a general look on the dataset (more understanding on dataset)
Input data
-
From a dictionary variable, use
pd.DataFrame
(cf)# from dictionary names = ['United States', 'Australia', 'Japan', 'India', 'Russia', 'Morocco', 'Egypt'] dr = [True, False, False, False, True, True, True] cpc = [809, 731, 588, 18, 200, 70, 45] my_dict = {'country':names, 'drives_right':dr, 'cars_per_cap':cpc} cars = pd.DataFrame(data = my_dict)
cars.index = row_labels
: set index for rows instead of automate numbers whererow_labels
is a list.
-
From a csv file, use
train = pd.read_csv(<file>)
- Using
index_col = 0
to hide the automate index.
- Using
Access : loc
, iloc
- Square brackets
- Column access:
cars[['country','capital']]
- Row access: only through slicing
cards[1:4]
- Column access:
loc
(label based) (cf)- Row access:
cars.loc[['RU', 'USA']]
- Column access:
cars.loc[:,'country']
- Row & column:
cars.loc[['RU'],['country']
- Row access:
- If using single bracket, it’s Panda Series type (
pandas.core.series.Series
), if using double brackets, it’s Pandas DataFrame type (pandas.core.series.Series
)! -
iloc
: select rows and columns by number (integer-location based indexing) [xem thêm]# Rows: data.iloc[0] # first row of data frame (Aleshia Tomkiewicz) - Note a Series data type output. data.iloc[1] # second row of data frame (Evan Zigomalas) data.iloc[-1] # last row of data frame (Mi Richan) # Columns: data.iloc[:,0] # first column of data frame (first_name) data.iloc[:,1] # second column of data frame (last_name) data.iloc[:,-1] # last column of data frame (id) # Multiple row and column selections using iloc and DataFrame data.iloc[0:5] # first five rows of dataframe data.iloc[:, 0:2] # first two columns of data frame with all rows data.iloc[[0,3,6,24], [0,5,6]] # 1st, 4th, 7th, 25th row + 1st 6th 7th columns. data.iloc[0:5, 5:8] # first 5 rows and 5th, 6th, 7th columns of data frame (county -> phone1).
dataset.iloc[:,:-1].values
: chọnvalues
của tất cả dòng (:
) và tất cả cột trừ cột cuối (:-1
)
Filtering Pandas DataFrame
- Goal: select a conditional column from a data
- Using Pandas Series, not Pandas DataFrame!
- Comparison:
brics["area"] > 8
- Total:
brics[brics['area'] > 8]
- Boolean operators, need to use
np.logical_and
or others!bricks[ np.logical_and(brics['area'] > 8, brics['area'] < 10)]
.apply()
Create a new column counting the lenght of elements in another column, we can using .apply
function
brics['name_lenght'] = brics['country'].apply(len)
That means we wanna apply the len
function to the column country
.