Codecademy - DS 1 | Notes of Thi

Codecademy - DS 1

Posted on 24/07/2019, in Data Science.

This note is created when I started to learn the Data Science on Codecademy.

A day in life - Data Analyst

Data extraction with SQL
Programming basics with Python
Data analysis using pandas, a Python library
Data visualization using Matplotlib, a Python library
Machine Learning using scikit-learn, a Python library

Relational Database Management System (RDMS)

RDBMS use SQL language to access the database.
Popular RDBMS:
- SQLite:
  - all of the data can be stored locally
  - popular choice for databases in cellphones, PDAs, MP3 players, set-top boxes, and other electronic gadgets. The SQL courses on Codecademy use SQLite.
- MySQL:
  - the most popular open source SQL database
  - easy to use, inexpensive, reliable, large community of developers
  - poor performance when scaling, open source development has lagged
  - does not include some advanced features that developers may be used to
- PostgreSQL:
  - open source SQL database
  - shares many of the same advantages of MySQL
  - foreign key support without requiring complex configuration.
  - slower in performance than other databases
- Oracle DB:
  - not open sourced (Oracle Corporation owns)
  - for large applications, particularly in the banking industry
- SQL Server:
  - Microsoft owns
  - Large enterprise applications mostly use SQL Server.
  - offers a free entry-level version called Express

SQL

Just look up at this site!

ALTER TABLE statement adds a new column to a table.

ALTER TABLE celebs 
ADD COLUMN twitter_handle TEXT;

Constraints that add information about how a column can be used are invoked after specifying the data type for a column.

CREATE TABLE celebs (
   id INTEGER PRIMARY KEY, 
   name TEXT UNIQUE,
   date_of_birth TEXT NOT NULL,
   date_of_death TEXT DEFAULT 'Not Applicable'
);

AS
```
SELECT name AS 'ten'
FROM movies;
```
DISTINCT is used to return unique values in the output. It filters out all duplicate values in the specified column(s).
LIKE can be a useful operator when you want to compare similar values. Check this for other usesages.
A CASE statement allows us to create different outputs (usually in the SELECT statement). It is SQL’s way of handling if-then logic.
Cross join

with statements

WITH previous_results AS (
   SELECT ...
   ...
   ...
   ...
)
SELECT *
FROM previous_results
JOIN customers
  ON _____ = _____;

Numpy with Statistics

np.percentile(d, 40) gives the number which divides array d into 40% and 60%.

histogram:

plt.hist(commutes, range=(20,50), bins=6)

A unimodal dataset has only one distinct peak. (1 đỉnh)
A bimodal dataset has two distinct peaks. This often happens when the data contains two different populations. (2 đỉnh)
A multimodal dataset has more than two peaks.
A uniform dataset doesn’t have any distinct peaks.
A symmetric dataset has equal amounts of data on both sides of the peak. Both sides should look about the same.
A skew-right dataset has a long tail on the right of the peak, but most of the data is on the left.
A skew-left dataset has a long tail on the left of the peak, but most of the data is on the right.
The type of distribution affects the position of the mean and median. In heavily skewed distributions, the mean becomes a less useful measurement.
the normal distribution, which is a symmetric, unimodal distribution.
random number generator (fit a normal distribution):
- a = np.random.normal(loc=0, scale=1, size=100000)
- loc (mean of normal dist), scale (SD of ND), size (# of random numbers)
We expect that 68% of our dataset to be between [mean-std, mean+std]
- 68% of our samples will fall between +/- 1 standard deviation of the mean
- 95% of our samples will fall between +/- 2 standard deviations of the mean
- 99.7% of our samples will fall between +/- 3 standard deviations of the mean
The binomial distribution can help us. It tells us how likely it is for a certain number of “successes” to happen, given a probability of success and a number of trials.
- The binomial distribution is important because it allows us to know how likely a certain outcome is, even when it’s not the expected one.
- Exp: 70% số người mua vị gà (70 trong 100 người sẽ chọn gà) nhưng khả năng “7 trong 10 người chọn gà” thì rất thấp (27% mà thôi).
- np.random.binomial(10, 0.30, size=10000)
```
# Our basketball player has a 30% chance of making any individual basket. He took 10 shots and made 4 of them, even though we only expected him to make 3. What percent chance did he have of making those 4 shots?
  
a = np.random.binomial(10, 0.30, size=10000)
np.mean(a == 4)
# 0.1973

# 2nd way
len(a[a==4]) / len(a)
```

Hypothesis Testing (SciPy)

Link course.
engagement -> time people spend on your website.
Performing an A/B test — are the different observations really the results of different conditions (i.e., Condition A vs. Condition B)? Or just the result of random chance?
Conducting a survey — is the fact that men gave slightly different responses than women a real difference between men and women? Or just the result of chance?
The individual measurements on Monday, Tuesday, and Wednesday are called samples. A sample is a subset of the entire population. The mean of each sample is the sample mean and it is an estimate of the population mean.
Central Limit Theorem:
- Sometime, you measured more one sample than the others. That makes your sample selection skewed to one direction of the total population.
- if we have a large enough sample size, all of our sample means will be sufficiently close to the population mean.
Hypothesis Tests:
- Hypothesis testing is a mathematical way of determining whether we can be confident that the null hypothesis is false.
- null hypothesis ($H_0$): the null hypothesis is the proposition that there is no effect or no relationship between phenomena or populations. (ThoughtCo)
- Chúng ta có thể test các null hypothesis này để thấy rằng chúng có thể sai mà từ đó thấy được mối quan hệ của các thành phần.
- The alternate hypothesis ($H_A$ or $H_1$)
- Example (How to State a Null Hypothesis?): Mối liên quan giữa số lần tập thể dục mỗi tuần và số kg giảm được. Giả sử mỗi tuần tập 5 lần sẽ giảm 6kg. Bây giờ ta giảm số lần tập mỗi tuần xuống còn 3 thì liệu số kg giảm được sẽ ít hơn 6 ko?
  - $H_A=H_1={ \mu<6 }$ (Alternate hypothesis)
  - $H_0 = { \mu\ge 6 }$ (chẳng những không giảm mà còn tăng)
  - Cách biểu diễn khác: $H_0 = { \mu = 6 }$ (giảm số lần tập cũng không ảnh hưởng đến số kg giảm)
- Other example:
  - “Hyperactivity is unrelated to eating sugar” (Tăng động không liên quan đến ăn đường) is an example of a null hypothesis.
- Type I = False Positive, Type II = False Negative. Check my article about Confusion matrix.
  - Type I = FP = the null hypothesis is rejected even though it is true.
  - Type II = FN = the null hypothesis is accepted even though it is false.
P-Values: A hypothesis test provides a numerical answer, called a p-value, that helps us decide how confident we can be in the result.
- a p-value is the probability that we yield the observed statistics under the assumption that the null hypothesis is true.
- Example: A p-value of 0.05 would mean that there is a 5% chance that there is no difference between the two population means.
- A higher p-value is more likely to give a FP so if we want to be very sure that the result is not due to just chance, we will select a very small p-value.