IBM Data Course 5: Databases and SQL for DS (w1-w2)

Posted on 06/04/2019, in Data Science, Python.

This note was first taken when I learnt the IBM Data Professional Certificate course on Coursera.

Go back to Course 4.

Go to Course 5 Week 3 & 4.

In this post

Week 1: Introduction to Databases and Basic SQL
Week 2: Advanced SQL

Week 1: Introduction to Databases and Basic SQL

Basics

SQL is 1 of the top 3 skills for data scientist.
SQL is used for communicating with databases (query data)
SQL = Structured English Query Language
Data is a collection of facts in the form of words, numbers or even pictures.
Database
- is a respository of data.
- It’s a program that stores data.
- provides the functionality for adding, modifying and querying that data
- different kind of databases store data in diff forms.
- Data stored in a tabular form is a relational database.
- A set of software tools for the data in the database is called a database management system or DBMS for short.
- RDBMS = relational database management system
  - Example: MySQL, Oracle Database
Basic commands: CREATE, INSERT, SELECT, UPDATE, DELETE.

How to create a Database instance on Cloud?

Advantages of using Cloud Database
- ease of use
- scalability
- disaster recovery
An instance of the Cloud database operates as a service that handles all application requests to work with the data and any of the databases managed by that instance.
IBM Db2 Warehouse on Cloud
IBM Db2 service
Page of apps (need to log in)

Basic SQL

DDL = Data Definition Language (define, change or drop data)
DML = Data Manipulation Language (read and modify data)
The primary key of a relational table uniquely identifies each row in a table.
CREATE statement
- Syntax
```
  create table TABLENAME (
      COLUMN1 datatype,
      COLUMN2 datatype,
      COLUMN3, datatype,
          ...
      ) ;
```
- Example
```
  create table TEST (
      ID integer,
      NAME varchar(30)
      );
```
- ID int NOT NULL, : it cannot contain a NULL or an empty value.
- PRIMARY KEY (ID) : If you look at the last row in the create table statement above you will note that we are using ID as a Primary Key and the database does not allow Primary Keys to have NULL values.
- A Primary Key is a unique identifier in a table, and using Primary Keys can help speed up your queries significantly.
  - It must contain a unique value for each row of data.
  - It cannot contain null values.
- DROP table before creating it to ignore the duplication error
```
  drop table COUNTRY;
  create table COUNTRY (
      ID integer PRIMARY KEY NOT NULL,
      CCODE char(2),
      NAME varchar(60)
      );
```
- Use the DEFAULT clause in the CREATE TABLE statement to specify the default value for the database server to insert into a column when no explicit value for the column is specified.
- Use the CHECK clause to designate conditions that must be met before data can be assigned to a column during an INSERT or UPDATE statement.
SELECT statement
- is DML which is used to read and modify data.
- The output of a select statement is called the Result Set or a Result Table.
- The order of columns displayed always matches the order in the SELECT statement.
- used with WHERE with operators: =, >, <, >=, <=, <>
- Examples
```
  select * from book
		
  select <col1>, <col2> from book
		
  select book_id, title from book WHERE book_id='B1'
```
COUNT
- To get the total number of rows in a given table, simply issue: select COUNT(*) from tablename
- retrieve the number of rows where the medal recipient is from CANADA: select COUNT(COUNTRY) from MEDALS where COUNTRY='CANADA'
DISTINCT is used to remove duplicate values from a result set.
- To retrieve unique values in a column issue: select DISTINCT columnname from tablename
- retrieve the list of unique countries that received GOLD medals, that is, removing all duplicate values of the same country: select DISTINCT COUNTRY from MEDALS where MEDALTYPE = 'GOLD'
LIMIT is used for restricting the number of rows retrieved from the database.
- To retrieve just the first 10 rows in a table: select * from tablename LIMIT 10
- This can be very useful to examine the result set by looking at just a few rows instead of retrieving the entire result set which may be very large.
- retrieve just a few rows in the MEDALS table for a
- particular year, you can issue: select * from MEDALS where YEAR = 2018 LIMIT 5

INSERT

add new rows to the table
DML
You can insert 1 row, 2 rows, or more than two rows with a single INSERT statement.

Examples,

  INSERT INTO <tablename> 
      (<col1>, <col2>,...) 
      VALUES 
          (<val1a>, <val2a>, ...), 
          (<val1b>, <val2b>, ...)

UPDATE & DELETE
- DML
- Update
```
  UPDATE table_name
  SET column1 = value1, column2 = value2, ...
  WHERE condition;
```
  - Be careful and specify a where clause if you intend to update just specific rows, otherwise all rows will be updated.
- DELETE
```
  DELETE FROM table_name WHERE condition;
```
  - If no WHERE clause is used, all rows will be removed.

Relational database concepts (optional)

Information model & Data model
- Different types of models
- An information model is an abstract formal representation of entities that includes their properties, relationships, and the operations that can be performed on them.
- An information model is at the conceptual level and defines relationships between objects.
- Data models are defined in a more concrete level, are specific and include details.
  - The hierarchical model organizes its data using a tree structure
  - The relational model is the most used data model for databases
    - Allows for data independence
    - Data is stored in tables with entity relationships
  - An entity-relationship data model or ER data model, is an alternative to a relational data model.
- The building blocks of an ER diagram are entities and attributes.
  - Entities have attributes, which are the data elements that characterize the entity.
  - Attributes tell us more about the entity.
  - Ex: book -> entity, attributes -> book title, year,…
  - Entity -> table in database
  - Atributes -> columns in the table
- Foreign keys (FK): is defined in a second table, but it refers to the primary key or a unique key in the first table
  - For example, a table called Employees has a primary key called employee_id. Another table called Employee Details has a foreign key which references employee_id in order to uniquely identify the relationship between the two tables.
- A table containing one or more foreign keys is called a Dependent table.
Types of relationship between entities
- Building blocks of relationship are
  - Entities sets: rectangle
  - Relationship sets: diamond with lines conneting associated entities.
  - Crows foot notations: > | < symbols
  - Attributes: ovals
- One-to-One relationship: one book must be written by one author
- One-to-Many relationship or Many-to-One relationship: A book written by many authors.
- Many-to-Many relationship: many authors write many books.
Mapping entities to tables
- ERD = Entity relational diagram
Relational model concepts
- Relational model: based on the concept of relation.
- relational dabase
  - a set of relations
  - relation = mathematical term for table
  - 2 parts
    - relational schema
    - relational instance
  - Relational Schema: name of a relation and attributes
  - Relational Instance: a table made up of attributes or columns.
    - column = attributes = field
    - Row = tuple
  - Relation: degree & cardinality
    - degree = the number of columns/attributes in a relation
    - cardinality = the number of tuples/rows

Week 2: Advanced SQL

How to create databses + upload csv files to cooperate with them on IBM Clound Db2? Check the instruction on week 2 (Week2Lab1v4-RA.pdf)

String Patterns, Ranges, Sorting, and Grouping

Using String Patterns, Ranges

Retrieving rows from a table

String Pattern: in the case we don’t know/remember exactly the key to apply to WHERE

  SELECT firstname 
      FROM author
      WHERE firstname like 'R%'

Range

  SELECT title, pages
      FROM book
      WHERE pages >= 290 AND pages <= 300
		
  SELECT title, pages
      FROM book
      WHERE pages BETWEEN 290 AND 300

Set of Values

  SELECT firstname, lastname, country
      FROM author
      WHERE country='AU' OR country='BR'
		
  SELECT firstname, lastname, country
      FROM author
      WHERE country IN ('AU', 'BR')

Sorting result sets

Use ORDER BY (DESC or ASC - default)

  SELECT title 
      FROM book
      ORDER BY title DESC
		
  SELECT title, pages
      FROM book
      ORDER BY 2 -- column 2 (pages)

Grouping Result Sets

DISTINCT: eliminating duplicates

  SELECT DISTINCT(country)
      FROM author

GROUP BY: count the number of row has the same values

  SELECT country, COUNT(country) 
      FROM author 
      GROUP BY country
		
  SELECT country, COUNT(country)
      AS count -- create a column named "count" containing the numbers
      FROM author
      GROUP BY country

HAVING: check more than 4 authors from the same country, only works with the GROUP BY clause.

  SELECT country, COUNT(country)
      AS count
      FROM author
      GROUP BY country
      HAVING COUNT(country) > 4

Functions, Sub-Queries, Multiple Tables

built-in functions to perform operations on data right within the database.
reduces network traffic and use of bandwidth
It is also possible to create your own functions in the database
One of the limitations of built-in aggregate functions like AVG() is that they cannot always be evaluated in the WHERE clause -> need to use sub-queries!!

Petsale table

Add up all the values in the SALEPRICE column and explicitly name the output column SUM_OF_SALEPRICE
```
  SELECT SUM(SALEPRICE) AS SUM_OF_SALEPRICE FROM PETSALE
```
Maximum QUANTITY of any ANIMAL sold:
```
  select MAX(QUANTITY) from PETSALE
```
Average value of SALEPRICE :
```
  select AVG(SALEPRICE) from PETSALE
```

Average SALEPRICE per ‘Dog’ :

  select AVG( SALEPRICE / QUANTITY ) from PETSALE where ANIMAL = 'Dog'

Round UP/DOWN every value in SALEPRICE column to nearest integer:
```
  select ROUND(SALEPRICE) from PETSALE
```
Retrieve the length of each value in ANIMAL column
```
  select LENGTH(ANIMAL) from PETSALE
```
Retrieve the ANIMAL column values in UPPERCASE format
```
  select UCASE(ANIMAL) from PETSALE
```

Use the function in a WHERE clause

  select * from PETSALE where LCASE(ANIMAL) = 'cat'

Use DISTINCT() function to get unique values

  select DISTINCT(UCASE(ANIMAL)) from PETSALE

Date, Time functions: Functions exist to extract the DAY, MONTH, DAYOFMONTH, DAYOFWEEK, DAYOFYEAR, WEEK, HOUR, MINUTE, SECOND.

Extract the DAY portion from a date

select DAY(SALEDATE) from PETSALE where ANIMAL = 'Cat'

Get the number of sales during the month of may (i.e. month 5):
```
select COUNT(*) from PETSALE where MONTH(SALEDATE)='05'
```

You can also perform DATE or TIME arithmetic

select (SALEDATE + 3 DAYS) from PETSALE

Special registers CURRENT TIME and CURRENT DATE are also available:
```
select (CURRENT DATE - SALEDATE) from PETSALE
```

Sub-Queries and Nested Selects

One of the limitations of built-in aggregate functions like AVG() is that they cannot always be evaluated in the WHERE clause.

we want to retrieve the list of employees who earn more than the average salary.

  select EMP_ID, F_NAME, L_NAME, SALARY from employees 
      where SALARY < 
   		(select AVG(SALARY) from employees);

The IN operator can also be used and there can be multiple leves of sub-queries, such as:

  select EMP_ID, F_NAME, L_NAME, DEP_ID from employees 
   where DEP_ID IN  
   ( select DEP_ID from employees where DEP_ID > 
           ( select MIN(DEP_ID) from employees ) 
   );

The sub-select doesn’t just have to go in the where clause, it can also go in other parts of the query such as in the list of columns to be selected:

  select EMP_ID, SALARY,
      ( select AVG(SALARY) from employees ) 
      AS AVG_SALARY
    	from employees ;
	
  select * 
      from 
      ( select EMP_ID, F_NAME, L_NAME, DEP_ID 
          from employees) ;

Working with multiple tables

There are several ways to access multiple tables in the same query:

Sub-queries
Implicit JOIN
JOIN operators (INNER JOIN, OUTER JOIN, etc.)

Petsale table

Sub-queries

retrieve only the employee records that correspond to departments in the DEPARTMENTS table

  select * from employees 
    where DEP_ID IN
    ( select DEPT_ID_DEP from departments );

we want to retrieve only the list of employees from a specific location

  select * from employees 
    where DEP_ID IN
     ( select DEPT_ID_DEP from departments 
       where LOC_ID = 'L0002' );

retrieve the department ID and name for Empolyees who earn more than 70000:

 select DEPT_ID_DEP, DEP_NAME from departments
  where DEPT_ID_DEP IN
( select DEP_ID from employees 
  where SALARY > 70000 ) ;

Implicit JOIN

full join

  select * from employees, departments;

we limit the result set to only rows with matching department IDs:

 select * from employees, departments 
   where employees.DEP_ID = departments.DEPT_ID_DEP;

Since the table names can be sometimes long, we can use shorther aliases for table names as follows:
```
  select * from employees E, departments D 
    where E.DEP_ID = D.DEPT_ID_DEP;
```

the column names in the select clause can be pre-fixed by aliases:

  select E.EMP_ID, D.DEPT_ID_DEP  
    from employees E, departments D 
    where E.DEP_ID = D.DEPT_ID_DEP;

we want to see the department name for each employee:

  select E.EMP_ID, D.DEP_NAME from 
    employees E, departments D
    where E.DEP_ID = D.DEPT_ID_DEP

JOIN OPERATOR

Check at Go to Course 5 Week 3 & 4

Relational model constraints (optional)

At least one author writes one book. This is a one to one relationship. To look up the author information, the book entity refers to the author entity. In a relational data model, this is called referencing.
Primary Key which uniquely identifies a row in a table but not for the foreign key (FK)
Parent table: a table containing a Primary key that is related to at least one Foreign key.
Dependent table: a table containing one or more Foreign keys.
constraint: điều kiện ràng buộc. 6 different types of relational model constraint
- Entity Integrity Constraint: example : primary key. Terms: primary and key constraint or unique constraint are also used. This constraint prevents duplicate value in the table.
- Referential Integrity Constraint ensures the validity of the data using a combination of Primary Keys and Foreign Keys.
- Semantic Integrity Constraint: The semantic integrity constraint refers to the correctness of the meaning of the data. For example, in the relation author, if the attribute or column city contains a garbage value instead of Toronto, the garbage value does not have any meaning. The semantic integrity constraint is related to the correctness of the data.
- Domain constraint:A domain constraint specifies the permissible values for a given attribute. For example, in the relation author, the attribute country must contain a two letter country code such as CA for Canada or IN for India. If a number value of 34 is entered for the country attribute instead of a two let her country code, the value 34 does not have any meaning.
- Null constraint: The null constraint specifies that attribute values cannot be null
- Check constraint: The check constraint enforces domain integrity by limiting the values that are accepted by an attribute.