ML Coursera 9 - w9: Anomaly Detection & Recommender Systems

Posted on 03/11/2018, in Machine Learning.

This note was first taken when I learnt the machine learning course on Coursera.
Lectures in this week: Lecture 15, Lecture 16.

This note of Alex Holehouse is also very useful (need to be used alongside with my note).

Go back to Week 8.

In this post

Density Estimation
Building an Anomaly Detection System
Multivariate Gaussian distribution
- Multivariate Gaussian distribution
- Anomaly Detection using the Multivariate Gaussian distribution
Predicting movie ratings
- Problem formulation
- Content based recommendations
Collaborative filtering
- Collaborative filtering
- Collaborative filtering algorithm
Low rank matrix factorization
- Vectorization: low rank matrix factorization
- Implementational detail: mean normalization

Density Estimation

Problem motivation

Check if there is an anomalous data-point in the data?
How we do this?
- Using our training dataset we build a model with $p(x)$ “What is the probability that example x is normal?”
- If $p(x_{test}) < \epsilon$: flag this as an anomaly
- If $p(x_{test}) \ge \epsilon$: flag this as OK.
- $\epsilon$ is some threshold probability value which we define, depending on how sure we need/want to be
Application: Fraud detection, Manufacturing (aircraft engine), Monitoring computers in data center,…

Gaussian distribution (normal distribution)

$\mu$: mean, average of examples.
$\sigma^2$: standard deviation squared.
We can take $m-1$ instead of $m$ for $\sigma^2$. Slightly different mathematical problems, but in practice it makes little difference.

$$ \begin{align} P(x\mid \mu ,\sigma^2) &= {\frac{1}{\sqrt {2\pi \sigma ^{2}}}}e^{-{\frac {(x-\mu )^{2}}{2\sigma ^{2}}}}, \\ \mu &= \frac{1}{m}\sum_{i=1}^m x^{(i)}, \\ \sigma^2 &= \frac{1}{m}\sum_{i=1}^m (x^{(i)}-\mu)^2 \end{align} $$

Gaussian distribution

Algorithm

Note of Househole
Given training set $x^{(i)}\in \mathbb{R}^n$ for $i=1,…,m$.
$x_j$ feature! for $j=1,…,n$.
At each feature, there is m example $x_j^{(i)}$ for $i=1,…,m$. From these m examples, we can find (for each feature $x_j$) a couple $\mu_j, \sigma_j^2$.
Given a test $x^{(test)}_{j}$, we can compute
$p(x^{(test)}) = \Pi_{j=1}^n p(x^{(test)}_j;\mu_j,\sigma_j^2)$
The importance is the choice of $\epsilon$.

Building an Anomaly Detection System

Developing and Evaluating an Anomaly Dection System

Note of Househole
To develop an anomaly detection system quickly, would be helpful to have a way to evaluate your algorithm
Sometimes, people use the same data set for Cross Validation set and Test but this is not a good practice and less recommended.

See again “Handling skewed data”.

So far we’ve been treating anomalous detection with unlabeled data. If you have labeled data allows evaluation: if you think something iss anomalous you can be sure if it is or not.
If you have CV set you can see how varying $\epsilon$ effects various evaluation metrics. Then pick the value of epsilon which maximizes the score on your CV set
Evaluate algorithm using cross validation
Do final algorithm evaluation on the test set

Anomaly Detection vs Supervised Learning

If we have labeled data, why we not use a supervised learning algorithm?
When to use anomaly, when to use supervised?

Anomaly Detection vs Supervised Learning

Anomaly Detection vs Supervised Learning 2

Choosing what features to use

If our data after plot (hist) doesn’t look like Gaussian shape, we can “make” it again by change the feature a little bit.
Suppose your anomaly detection algorithm is performing poorly and outputs a large value of p(x) for many normal examples and for many anomalous examples in your cross validation dataset. Which of the following changes to your algorithm is most likely to help?
- Try coming up with more features to distinguish between the normal and the anomalous examples.

Multivariate Gaussian distribution

Problem: in each single gaussian, it’s good with high probability but actually not!

Multivariate Gaussian

Multivariate Gaussian 2

Multivariate Gaussian 3

Multivariate Gaussian 4

Multivariate Gaussian 5

Multivariate Gaussian 6

Anomaly Detection using the Multivariate Gaussian distribution

Anomaly Detection using Multivariate Gaussian

Anomaly Detection using Multivariate Gaussian 2

Original vs Multivariate

(check Househole’s note for more details)

Predicting movie ratings

Check Househole’s note.
Recommender Systems: important application of ML.
- In academy, it has less attension but in industry, it costs!
There is not so much a technique, it’s a big idea.
Recommender systems do this - try and identify the crucial and relevant features

Problem formulation

There are some movies that some users haven’t seen yet. Based on what they have already seen or others did, we can predict their rate for movies that they haven’t watched. From these prediction, we can recommend the movies they like (if the predicted rating is high).

Content based recommendations

This approach is called content-based approach because we assume we have features regarding the content which will help us identify things that make them appealing to a user.
- However, often such features are not available - next we discuss a non-contents based approach!

Collaborative filtering

It can learn for itself what features it needs to learn.
The values of $\Theta$ show the “idea” of the user. For example, Alice says that she really like the romantic movie and she doesn’t like action movies, i.e. $\Theta = [0 5 0]$.

Collaborative filtering