menu
Anh-Thi DINH

DataQuest 1: Step 1 - Python Introduction

Posted on 09/09/2018, in Data Science, Python.

This note is used for my notes about the Data Scientist path on dataquest. I take this note after I have some basics on python with other notes, that’s why I just write down some new-for-me things.

Mission 1 - Python basics

Mission 2 - Files and loops

  • Add an element to a list: a.append(<element>)
  • Open a file: f = open("test.txt", "r") where r as read this file.
  • Read a file content: a = f.read()
  • Split a string: a.split("<symbol>")
  • Procedure:

    f = open("<file-name>", "r")
    data = f.read()
    data_split = data.split("\n")
    list_of_list = []
    for row in data_split:
    list_of_list.append(row.split(","))
    

Mission 80 - Booleans and If statements

Mission 3 - List operators

Mission 70 - Ditionaries

  • Key can be a number.
  • Search key in dict: <key> in <dict>
  • Create an empty dictionary: a = {}

Mission 4 - Introduction to functions

  • Structure

    def <function-name>(<input>):
      <arguments>
      return <output>
    

Mission 28 - Debugging errors

  • SyntaxError: lack of " or de instead of def
  • IndentationError: differences indentation for lines.
  • See more the exceptions and list of errors
  • Runtime errors
    • TypeError: errors in type of variables.
    • Traceback: block of codes may cause the errors.
    • ValueError: convert a string to a float, for example.
  • IndexError: try to access an element that’s not in a list’s index
  • AttributeError: try to call a method or attribute on an object that doesn’t contain it

Mission 9 - Guided project: Explore U.S. Births

cloud_download Download Reference solution.

Sample code

def month_births(input_lstlst):
    births_per_month = {}
    for lst in input_lstlst:
        if lst[1] in births_per_month:
            births_per_month[lst[1]] += lst[4]
        else:
            births_per_month[lst[1]] = lst[4]
    return births_per_month

cdc_month_births= month_births(cdc_list)
cdc_month_births

Mission 5 - Modules

A module is a collection of functions and variables that have been bundled together in a single file.

  • Read the list with csv

    import csv
    f = open("<data>.csv")
    csvreader = csv.reader(f)
    data = list(csvreader)
    

Mission 66 - Classses and Objects

  • Class’s name shoule be in PascalCase (Capitalize the fiest letter of each word)
  • Internal functions of a class are called methods
  • we had to pass in an argument to the __init__() method, self
  • If you didn’t have self, then the class wouldn’t know where to store the internal data you wanted to keep
  • There is a corresponding between .type inside class and type() method

    class Dataset:
      def __init__(self):
          self.type = "csv"
    
  • enumerate: look for the indexes and get the values corresponding to this index

    for idx, value in enumerate(['foo', 'bar']):
      print(idx, value)
    
  • set(<list>) gives unique variables of this list (or transform a list to be a set)
  • Python special methods : like __init__
    • When we implemented __init__(), it told the python interpreter that anything within that method is what we want to initialize when we create our object.
    • __str__: return display of a dataset
  • Sample code of Dataset class in this mission

    class Dataset:
     def __init__(self, data):
         self.header = data[0]
         self.data = data[1:]
       
     # Add the special method here
     def __str__(self):
         return str(self.data[0:10])
       
     def column(self, label):
         if label not in self.header:
             return None
           
         index = 0
         for idx, element in enumerate(self.header):
             if label == element:
                 index = idx
           
         column = []
         for row in self.data:
             column.append(row[index])
         return column
       
           
     def count_unique(self, label):
         unique_results = set(self.column(label))
         count = len(unique_results)
         return count
    

nfl_dataset = Dataset(nfl_data) print(nfl_dataset) ​ ~~~

Mission 7 - Error Handling

  • If you surround the code that causes an error with a try/except block, the error will be handled, and the code will continue to run:

    try:
      int('')
    except Exception:
      print("There was an error")
    
  • When the Python interpreter generates an exception, it actually creates an instance of the Exception class
  • except Exception as exc: assign the instance of the Exception class to the variable exc.
  • We can use the pass keyword to avoid generating an error (if we don’t want to do anything)

    try:
      int('')
    except Exception:
      pass
    

Mission 16 - List Comprehensions

  • enumerate()allows us to have 2 variables in the body of a for loop.
  • enumerate of a list of lists return idx as a idx of rows
  • Using foor loop inside

    animal_lengths = [len(animal) for animal in animals]
    teams = [row[1] for row in nfl_suspensions]
    
  • None object (type NoneType) and using var is None to check
  • is check for object equality instead of ==comparing logical (errors when comparing True with None)
  • <dict>.items() method, which allows us to iterate through keys and values at the same time.

    for key, val in <dict>.items():
    

Challenge 17 - Variable Scopes

  • Once we overwrite the sum variable with a value, we can’t access the function anymore.
  • Local scope variable has the same name with global ones doesn’t lead to an error!
  • If local var doesn’t exist, python will check the global var with the same name and use it but cannot change it
  • local > global > built-in functions/variables
  • Defind global <var> in separated line and then use it later.

Mission 82 - Regular expressions

  • What it is?

    Regular expression

  • Another example, looks like wildcat (using . as an replacement character)

    Regular expression 2

  • Some special cases:
    • . for a character,
    • ^a all strings start with a,
    • a$ all strings end with a
    • [bcr]at any characters within [] can be filled the space
    • Using \ to escape special charaters
    • "cat|dog" would match "catfish" and "hotdog" for begin and end charaters
    • "[0-9]" will match any character that falls between 0 and 9
    • "[a-z]" : lowercase
    • "[0-9]{4}": repeat the pattern "[0-9]" four times by writing
  • Using re module (package) for the regular expression.
  • re.search(regex, string): whether is string is a match for regex

    if re.search("^[\[\(][Ss]erious[\]\)]",post[0]) is not None:
          serious_start_count += 1
    
  • re.sub("yo", "hello", "yo world") gives "Hello world"
  • re.findall("[a-z]", "abc123") would return ["a", "b", "c"], because those are the substrings that match the regex.

Mission 84 - Dates in Python

time module

  • import time
  • time module represents Unix timestamps (from the epoch - 1970)
  • time.time() gives the current timestamp (the number of seconds from epoch)
  • time.gmtime(<time-stamp>) give human readable struct_time class
    • <struct_time>.tm_year gives year
    • <struct_time>.tm_mon gives month (1-12)
    • <struct_time>.tm_mday gives day (1-31)
    • <struct_time>.tm_hour gives hour (0-23)
    • <struct_time>.tm_min gives minute (0-59)

datetime module

  • import datetime: we can perform arithmetic on time.
  • These datetime instances appear similar to struct_time instances. Attributes: year, month, day, hour, minute, second, microsecond
  • datetime.datetime.utcnow() gives the current utc time
  • datetime.datetime.now() gives datetime.datetime(2018, 9, 9, 7, 45, 36, 986100)
  • datetime.timedelta (cf) if we want to perform arithmetic.
    • datetime.timedelta(weeks = 1, days = 23)
    • attributes: weeks, days, hours, minutes, seconds, milliseconds, microseconds
    import datetime
    kirks_birthday = datetime.datetime(year = 2233, month = 3, day = 22)
    diff = datetime.timedelta(weeks = 15)
    before_kirk = kirks_birthday - diff
    
  • datetime.datetime.strftime() (cf) converts to human readable

    march3 = datetime.datetime(year = 2010, month = 3, day = 3)
    pretty_march3 = march3.strftime("%b %d, %Y")
    print(pretty_march3)
    
  • datetime.datetime.strptime() contrary to strftime

    march3 = datetime.datetime.strptime("Mar 03, 2010", "%b %d, %Y")
    
  • datetime.datetime.fromtimestamp() converts from unix timestamp to datetime

Mission 218 - Guided Project: Exploring Gun Deaths in the US

cloud_download Download Reference solution.
  • Any beginning setting up

    import csv
    with open("guns.csv", "r") as f:
      reader = csv.reader(f)
      data = list(reader)
    print(data[0:5])
    
  • Count how many in each sex type (sample code)

    sexes = [row[5] for row in data]
    sex_counts = {}
    for item in sexes:
      if item in sex_counts:
          sex_counts[item] += 1
      else:
          sex_counts[item] = 1
    print(sex_counts)
    
  • Remember that: using <dict>.items() for dictionary and enumerates(<list>) for list in for key, val in ...

Top