ML Coursera 8 - w8: Unsupervised Learning

Posted on 27/10/2018, in Machine Learning.

This note was first taken when I learnt the machine learning course on Coursera.
Lectures in this week: Lecture 13, Lecture 14.

This note of Alex Holehouse is also very useful (need to be used alongside with my note).

Go back to Week 7.

Clustering

Unsupervised Learning

This is the first unsupervised learnig.
There is no label associated to it.
There is X but not y.
Unsupervised learning
- Try and determining structure in the data
- Clustering algorithm groups data together based on data features

K-means Algorithm

The most popular and widely used algorithm.
See here for the figure of idea
Algorithm: Cluster assignment step and Move centroid step

Optimization objective

K-means also have an optimization objective (cost function like supervised learning)
Knowing this is helpful because
- Help for debugging
- Help find better clustering
distortion cost function
Minimize to find out the $c^{(i)}$ and $\mu_{c^{(i)}}$

Random Initialization

Also talk about how to avoid local optimal as well.
$K$: number of clusters, $m$: the number of training examples (usually $K<m$)
Random choose $K$ training examples as a initial cluster.
- If we choose the wrong examples, it may lead to a local optima
- To avoid that, dun random K-means multiple times (100 times, for example)
  - Give 100 diff ways to K
  - Choose the one make $J$ min
  - Only apply when $K=2,...,10$

Choosing the number of clusters

Elbow method: check the cost function J wrt number of clusters.
- Check the “elbow” (cái cùi chỏ) where the graph change direction much.
- EM isn't used very often!
If $k=5$ have bigger J than $k=3$ then k-means got stuck in a bad local minimum. You should try re-running k-means with multiple random initializations.
Choose the number of clusters depend on later/downstream purpose (choose the size of T-shirt, 3 or 5 for example).

Dimensionality reduction

Motivation: data compression + data visualization
Househole’s note.

Data compression

Speeds up algorithms + Reduces space used by data for them.
- If all points in 2D lay on 1 straight line, we can reduce them to 1D cases. The same for 3D to 2D

Data compression

Data Visualization

We want to reduce dimension to 2 or 3 so that we can visualize the data.

Principal component analysis (PCA)

PCA formulation

Househole’s note.
For the problem of dimensionality reduction the most commonly used algorithm is PCA
For example, we want to find a line to translate 2D to 1D: all data point projected to this line should be small. The vector of this line is what we wanna find.
You should normally do mean normalization and feature scaling (see again) on your data before PCA
PCA is not linear regression!!: PCA lấy error là các projection trong khi LG lấy error là sự sai lệch (theo y)
- Below photo comapre pca (right) and linear regression (left).
- LR has y to compare while PCA has no y, every x has equal role.

PCA algorithm

How to use PCA for yourself + and how to reduce dimension for your data.
Data preprocessing: feature scaling/mean normalization
svd in matlab/octave = “singular value decomposition” or eig function with the same function (svd is more numerically stable than eig)
- Compute covariance matrix $\Sigma$
- Compute Eigenvectors of matrix $\Sigma$
- Covariance matrix
  
  In probability theory and statistics, a covariance matrix is a matrix whose element in the $i, j$ position is the covariance between the i-th and j-th elements of a random vector.
- After having U, just take first k column if we wanna k dimension from n
Algorithm:

Applying PCA

Reconstruction from compressed representation

From z back to x: 2D back to 1D for example,

PCA Reconstruction

Choosing the number of principal components

PCA choosing k

Choo k from 1 to the one get the fraction less than 0.001.

PCA choosing k

The numerator is small: we lose very little information in the dimensionality reduction, so when we decompress we regenerate the same data.

Advice for applying PCA

Andrew uses PCA very often
Defined by PCA only on the training set
- And then use those obtained parameters to the Cross validation data and test set
A bad use of PCA: Use it to prevent over-fitting –> Better to use regularization
Always use normal approach (without PCA) to solve a problem, if you CANNOT DO MORE, then think about adding PCA.

Programming assignment

File ex7.pdf.

K-means Clustering

The K-means algorithm is a method to automatically cluster similar data examples together.

K-means algorithm is as follows

% Initialize centroids
centroids = kMeansInitCentroids(X, K);
for iter = 1:iterations
  % Cluster assignment step: Assign each data point to the
  % closest centroid. idx(i) corresponds to cˆ(i), the index
  % of the centroid assigned to example i
  idx = findClosestCentroids(X, centroids);
  % Move centroid step: Compute means based on centroid
  % assignments
  centroids = computeMeans(X, idx, K);
end

Size: $X\in \mathbb{R}^{m\times n}$, $K\in \mathbb{R}$, centroids $\in \mathbb{R}^{K\times n}$, $c\in \mathbb{R}^{m\times 1}$ (idx is c)
- $m$ examples, $n$ features

File findClosestCentroids.m

for i=1:size(X,1)
      min = 100; % initial
      for k=1:K
          if norm(X(i,:)-centroids(k,:)) < min
                  min = norm(X(i,:)-centroids(k,:));
                  min_idx = k;
          end
      end
      idx(i,1) = min_idx;
end

Computing centroid means: file computeCentroids.m

for k=1:K
      idxCk = find(idx == k); % index of all X that corresponding to centroid k
      centroids(k,:) = sum(X(idxCk,:))/size(idxCk,1);
end

File kMeansInitCentroids.m (Random initialization)

% Initialize the centroids to be random examples
% Randomly reorder the indices of examples
randidx = randperm(size(X, 1));
% Take the first K examples as centroids
centroids = X(randidx(1:K), :);

Image compression with K-means

Check the guide in ex7.pdf, page 7.

In this exercise, you will use the K-means algorithm to select the 16 colors that will be used to represent the compressed image. Concretely, you will treat every pixel in the original image as a data example and use the K-means algorithm to find the 16 colors that best group (cluster) the pixels in the 3-dimensional RGB space. Once you have computed the cluster centroids on the image, you will then use the 16 colors to replace the pixels in the original image.

PCA

PCA consists of two computational steps:

First, you compute the covariance matrix of the data.
Then, you use Octave/MATLAB’s SVD function to compute the eigenvectors $U_1, U_2,\ldots,U_n$.

Before using PCA, it is important to first normalize the data by subtracting the mean value of each feature from the dataset, and scaling each dimension so that they are in the same range.

File pca.m

Sigma = 1/m * (X'*X);
[U, S, ~] = svd(Sigma);

Dimensionality Reduction with PCA

File projectData.m:

Ureduce = U(:,1:K);
Z = X*Ureduce;

File recoverData.m

Ureduce = U(:,1:K);
X_rec = Z*Ureduce';

Face Image Dataset

In this part of the exercise, you will run PCA on face images to see how it can be used in practice for dimension reduction.

For example, if you were training a neural network to perform person recognition (gven a face image, predict the identitfy of the person), you can use the dimension reduced input of only a 100 dimensions instead of the original pixels.

Go to Week 9.