Codecademy - DS 1
Posted on 24/07/2019, in Data Science.This note is created when I started to learn the Data Science on Codecademy.
A day in life - Data Analyst
- Data extraction with SQL
- Programming basics with Python
- Data analysis using pandas, a Python library
- Data visualization using Matplotlib, a Python library
- Machine Learning using scikit-learn, a Python library
Relational Database Management System (RDMS)
- RDBMS use SQL language to access the database.
- Popular RDBMS:
    - SQLite:
        - all of the data can be stored locally
- popular choice for databases in cellphones, PDAs, MP3 players, set-top boxes, and other electronic gadgets. The SQL courses on Codecademy use SQLite.
 
- MySQL:
        - the most popular open source SQL database
- easy to use, inexpensive, reliable, large community of developers
- poor performance when scaling, open source development has lagged
- does not include some advanced features that developers may be used to
 
- PostgreSQL:
        - open source SQL database
- shares many of the same advantages of MySQL
- foreign key support without requiring complex configuration.
- slower in performance than other databases
 
- Oracle DB:
        - not open sourced (Oracle Corporation owns)
- for large applications, particularly in the banking industry
 
- SQL Server:
        - Microsoft owns
- Large enterprise applications mostly use SQL Server.
- offers a free entry-level version called Express
 
 
- SQLite:
        
SQL
- Just look up at this site!
- 
    ALTER TABLEstatement adds a new column to a table.ALTER TABLE celebs ADD COLUMN twitter_handle TEXT;
- 
    Constraints that add information about how a column can be used are invoked after specifying the data type for a column. CREATE TABLE celebs ( id INTEGER PRIMARY KEY, name TEXT UNIQUE, date_of_birth TEXT NOT NULL, date_of_death TEXT DEFAULT 'Not Applicable' );
- 
    ASSELECT name AS 'ten' FROM movies;
- DISTINCTis used to return unique values in the output. It filters out all duplicate values in the specified column(s).
- LIKEcan be a useful operator when you want to compare similar values. Check this for other usesages.
- A CASEstatement allows us to create different outputs (usually in the SELECT statement). It is SQL’s way of handling if-then logic.
- Cross join
- 
    withstatementsWITH previous_results AS ( SELECT ... ... ... ... ) SELECT * FROM previous_results JOIN customers ON _____ = _____;
Numpy with Statistics
- np.percentile(d, 40)gives the number which divides array- dinto 40% and 60%.
- 
    histogram: plt.hist(commutes, range=(20,50), bins=6)
- A unimodal dataset has only one distinct peak. (1 đỉnh)
- A bimodal dataset has two distinct peaks. This often happens when the data contains two different populations. (2 đỉnh)
- A multimodal dataset has more than two peaks.
- A uniform dataset doesn’t have any distinct peaks.
- A symmetric dataset has equal amounts of data on both sides of the peak. Both sides should look about the same.
- A skew-right dataset has a long tail on the right of the peak, but most of the data is on the left.
- A skew-left dataset has a long tail on the left of the peak, but most of the data is on the right.
- The type of distribution affects the position of the mean and median. In heavily skewed distributions, the mean becomes a less useful measurement.
- the normal distribution, which is a symmetric, unimodal distribution.
- random number generator (fit a normal distribution):
    - a = np.random.normal(loc=0, scale=1, size=100000)
- loc(mean of normal dist),- scale(SD of ND),- size(# of random numbers)
 
- We expect that 68% of our dataset to be between [mean-std, mean+std]
    - 68% of our samples will fall between +/- 1 standard deviation of the mean
- 95% of our samples will fall between +/- 2 standard deviations of the mean
- 99.7% of our samples will fall between +/- 3 standard deviations of the mean
 
- The binomial distribution can help us. It tells us how likely it is for a certain number of “successes” to happen, given a probability of success and a number of trials.
    - The binomial distribution is important because it allows us to know how likely a certain outcome is, even when it’s not the expected one.
- Exp: 70% số người mua vị gà (70 trong 100 người sẽ chọn gà) nhưng khả năng “7 trong 10 người chọn gà” thì rất thấp (27% mà thôi).
- np.random.binomial(10, 0.30, size=10000)
 # Our basketball player has a 30% chance of making any individual basket. He took 10 shots and made 4 of them, even though we only expected him to make 3. What percent chance did he have of making those 4 shots? a = np.random.binomial(10, 0.30, size=10000) np.mean(a == 4) # 0.1973 # 2nd way len(a[a==4]) / len(a)
Hypothesis Testing (SciPy)
- Link course.
- engagement -> time people spend on your website.
- Performing an A/B test — are the different observations really the results of different conditions (i.e., Condition A vs. Condition B)? Or just the result of random chance?
- Conducting a survey — is the fact that men gave slightly different responses than women a real difference between men and women? Or just the result of chance?
- The individual measurements on Monday, Tuesday, and Wednesday are called samples. A sample is a subset of the entire population. The mean of each sample is the sample mean and it is an estimate of the population mean.
- Central Limit Theorem:
    - Sometime, you measured more one sample than the others. That makes your sample selection skewed to one direction of the total population.
- if we have a large enough sample size, all of our sample means will be sufficiently close to the population mean.
 
- Hypothesis Tests:
    - Hypothesis testing is a mathematical way of determining whether we can be confident that the null hypothesis is false.
- null hypothesis ($H_0$): the null hypothesis is the proposition that there is no effect or no relationship between phenomena or populations. (ThoughtCo)
- Chúng ta có thể test các null hypothesis này để thấy rằng chúng có thể sai mà từ đó thấy được mối quan hệ của các thành phần.
- The alternate hypothesis ($H_A$ or $H_1$)
- Example (How to State a Null Hypothesis?): Mối liên quan giữa số lần tập thể dục mỗi tuần và số kg giảm được. Giả sử mỗi tuần tập 5 lần sẽ giảm 6kg. Bây giờ ta giảm số lần tập mỗi tuần xuống còn 3 thì liệu số kg giảm được sẽ ít hơn 6 ko?
        - $H_A=H_1={ \mu<6 }$ (Alternate hypothesis)
- $H_0 = { \mu\ge 6 }$ (chẳng những không giảm mà còn tăng)
- Cách biểu diễn khác: $H_0 = { \mu = 6 }$ (giảm số lần tập cũng không ảnh hưởng đến số kg giảm)
 
- Other example:
        - “Hyperactivity is unrelated to eating sugar” (Tăng động không liên quan đến ăn đường) is an example of a null hypothesis.
 
- Type I = False Positive, Type II = False Negative. Check my article about Confusion matrix.
        - Type I = FP = the null hypothesis is rejected even though it is true.
- Type II = FN = the null hypothesis is accepted even though it is false.
 
 
- P-Values: A hypothesis test provides a numerical answer, called a p-value, that helps us decide how confident we can be in the result.
    - a p-value is the probability that we yield the observed statistics under the assumption that the null hypothesis is true.
- Example: A p-value of 0.05 would mean that there is a 5% chance that there is no difference between the two population means.
- A higher p-value is more likely to give a FP so if we want to be very sure that the result is not due to just chance, we will select a very small p-value.
 
