menu
Anh-Thi DINH

Python Pandas

Posted on 05/09/2018, in Data Science, Python.

This note is used only for noting pandas package in python. You can see also: python note, data note or machine learning note.

Documentation

  • pandas is an open source, BSD-licensed library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language
  • Official pandas doc (use the search function)
  • Built first from numpy
  • 10 minutes to pandas

Installation

  • Go with Anaconda package manager
  • Usage: import pandas as pd

Input

  • train.head() (cf) : first 5 rows (default) of the data set, for a quick look in the data. You can use .head(n=<number>) to show <number> rows instead of 5.
  • train.tail() : last 5 rows
  • train.info() : show dtype of dataframe
  • train.describe() (cf) : look on distributions, dispersion, shape of dataset,… to have a general look on the dataset (more understanding on dataset)

Input data

  • From a dictionary variable, use pd.DataFrame (cf)

    # from dictionary
      
    names = ['United States', 'Australia', 'Japan', 'India', 'Russia', 'Morocco', 'Egypt']
    dr =  [True, False, False, False, True, True, True]
    cpc = [809, 731, 588, 18, 200, 70, 45]
      
    my_dict = {'country':names, 'drives_right':dr, 'cars_per_cap':cpc}
      
    cars = pd.DataFrame(data = my_dict)
    
    • cars.index = row_labels : set index for rows instead of automate numbers where row_labels is a list.
  • From a csv file, use train = pd.read_csv(<file>)

    • Using index_col = 0 to hide the automate index.

Access : loc, iloc

  • Square brackets
    • Column access: cars[['country','capital']]
    • Row access: only through slicing cards[1:4]
  • loc (label based) (cf)
    • Row access: cars.loc[['RU', 'USA']]
    • Column access: cars.loc[:,'country']
    • Row & column: cars.loc[['RU'],['country']
  • If using single bracket, it’s Panda Series type (pandas.core.series.Series), if using double brackets, it’s Pandas DataFrame type (pandas.core.series.Series)!
  • iloc: select rows and columns by number (integer-location based indexing) [xem thêm]

    # Rows:
    data.iloc[0] # first row of data frame (Aleshia Tomkiewicz) - Note a Series data type output.
    data.iloc[1] # second row of data frame (Evan Zigomalas)
    data.iloc[-1] # last row of data frame (Mi Richan)
      
    # Columns:
    data.iloc[:,0] # first column of data frame (first_name)
    data.iloc[:,1] # second column of data frame (last_name)
    data.iloc[:,-1] # last column of data frame (id)
      
    # Multiple row and column selections using iloc and DataFrame
    data.iloc[0:5] # first five rows of dataframe
    data.iloc[:, 0:2] # first two columns of data frame with all rows
    data.iloc[[0,3,6,24], [0,5,6]] # 1st, 4th, 7th, 25th row + 1st 6th 7th columns.
    data.iloc[0:5, 5:8] # first 5 rows and 5th, 6th, 7th columns of data frame (county -> phone1).
    
  • dataset.iloc[:,:-1].values: chọn values của tất cả dòng (:) và tất cả cột trừ cột cuối (:-1)

Filtering Pandas DataFrame

  • Goal: select a conditional column from a data
    • Using Pandas Series, not Pandas DataFrame!
    • Comparison: brics["area"] > 8
    • Total: brics[brics['area'] > 8]
  • Boolean operators, need to use np.logical_and or others!
    bricks[ np.logical_and(brics['area'] > 8, brics['area'] < 10)]
    

.apply()

Create a new column counting the lenght of elements in another column, we can using .apply function

brics['name_lenght'] = brics['country'].apply(len)

That means we wanna apply the len function to the column country.

Top