ML Coursera 3 - w3: Logistic Regression

Posted on 11/09/2018, in Machine Learning.

This note was first taken when I learnt the machine learning course on Coursera.
Lectures in this week: Lecture 6, Lecture 7.

Go back to Week 2.

In this post

Classification & Representation
Logistic regression model
Multiclass classification: one-vs-all
Solving the problem of overfitting
Exercice de programmation: Logistic Regression
- Logistic Regression
- Regularized logistic regression

Classification & Representation

Download Lecture 6.

Classification

Variable $y$ has discrete values.
Other name of : 1 (positive class), 0 (negative class) $\Rightarrow$ binary classification problem
If y has more than 2 values, it’s called multi classification
Using linear regression in this case seems not to be very good because there may be some values that effects much more than the others (blue line).
$h_{\theta}$ may take values >1 or <0 but we want $0\le h_{\theta} \le 1$. That’s why we need logistic regression, i.e. $h_{\theta}$ is always between $[0,1]$
Remember and not confused that logistic regression is just a classification regression in cases of y taking discrete values.

Hypothesis representation

What is the function we are going to use to represent the hypothesis
Logistic regression

$$ \begin{align} h_{\theta}(x) &= g(\theta^Tx) \\ g(z) &= \dfrac{1}{1+e^{-z}}, \\ h_{\theta} &= \dfrac{1}{e^{-\theta^Tx}} \end{align} $$

They are the same: sigmoid function = logistic function = $g(z)$

Logistic function g(z)

Some propabilities

$$ \begin{align*}& h_\theta(x) = P(y=1 | x ; \theta) = 1 - P(y=0 | x ; \theta) \newline& P(y = 0 | x;\theta) + P(y = 1 | x ; \theta) = 1\end{align*} $$

Decision Boundary

From the above figure, we see that

$\begin{align*}& h_\theta(x) \geq 0.5 \rightarrow y = 1 \newline& h_\theta(x) < 0.5 \rightarrow y = 0 \newline\end{align*}$

We have

$$ \begin{align*}& \theta^T x \geq 0 \Rightarrow y = 1 \newline& \theta^T x < 0 \Rightarrow y = 0 \newline\end{align*} $$

The decision boundary is the line that separates the area where y = 0 and where y = 1mark>. It is created by our hypothesis function.

An example,

Example of decision boundary

The training set is not used to determine the decision boundary, but parameter $\theta$. The training set is used only for fit the parameter $\theta$.

Example of decision boundary

Logistic regression model

Cost function

Look back to the cost function in linear regression.

$\begin{align} J(\theta) &= \dfrac{1}{m}\sum_{i=1}^m \text{Cost}(h_{\theta}(x^{(i)}),y^{(i)}), \\ \text{Cost}(h_{\theta}(x),y) &= \begin{cases} -\log(h_{\theta}(x)) \quad \text{ if } y=1, \\ -\log(1-h_{\theta}(x)) \quad \text{ if } y=0. \end{cases} \end{align}$

or we can write,

$$ \text{Cost}(h_{\theta}(x),y) = -y \log(h_{\theta}(x),y) - (1-y)\log(1-h_{\theta}(x),y) $$

The cost function is rewritten as

$$ J(\theta) = - \frac{1}{m} \displaystyle \sum_{i=1}^m [y^{(i)}\log (h_\theta (x^{(i)})) + (1 - y^{(i)})\log (1 - h_\theta(x^{(i)}))] $$

Vectorization

$$ \begin{align*} h &= g(X\theta) \\ J(\theta) &= \frac{1}{m} \cdot \left(-y^{T}\log(h)-(1-y)^{T}\log(1-h)\right) \end{align*} $$

Cost function of logistic regression 1

$\begin{align*} \mathrm{Cost}(h_\theta(x),y) &= 0 \text{ if } h_\theta(x) = y \\ \mathrm{Cost}(h_\theta(x),y) &\rightarrow \infty \text{ if } y = 0 \; \mathrm{and} \; h_\theta(x) \rightarrow 1 \\ \mathrm{Cost}(h_\theta(x),y) &\rightarrow \infty \text{ if } y = 1 \; \mathrm{and} \; h_\theta(x) \rightarrow 0 \end{align*}$

If hypothesis seems to be “wrong” ($h \to 1$ while $y\to 0$ or $h \to 0$ while $y\to 1$) then $\text{Cost}\to \infty$
$J(\theta)$ ins this style is always convex.

Simplified Cost Function and Gradient Descent

Review the gradient decent in linear regression.

In this logistic regression,

Repeat{ $$ \begin{align*} \theta_j := \theta_j - \frac{\alpha}{m} \sum_{i=1}^m (h_\theta(x^{(i)}) - y^{(i)}) x_j^{(i)} \end{align*} $$ (Simutanously update all $\theta_j$) }

Notice that, above equation looks the same with one in linear regression, the different is def of $h_{\theta}$!

Vectorization

$$ \theta := \theta -\frac{\alpha}{m} X^T(g(X\theta)-y) $$

Advanced Optimization

“Conjugate gradient”, “BFGS”, and “L-BFGS” are more sophisticated, faster ways to optimize $\theta$ that can be used instead of gradient descent. We suggest that you should not write these more sophisticated algorithms yourself (unless you are an expert in numerical computing) but use the libraries instead, as they’re already tested and highly optimized. Octave/Matlab provides them.

A single function that returns both $J(\theta)$ and $\frac{\partial}{\partial\theta_j}J(\theta)$

function [jVal, gradient] = costFunction(theta)
  jVal = [...code to compute J(theta)...];
  gradient = [...code to compute derivative of J(theta)...];
end

Then we can use octave’s fminunc() optimization algorithm along with the optimset() function that creates an object containing the options we want to send to fminunc().

options = optimset('GradObj', 'on', 'MaxIter', 100);
initialTheta = zeros(2,1);
[optTheta, functionVal, exitFlag] = fminunc(@costFunction, initialTheta, options);

We give to the function fminunc() our cost function, our initial vector of theta values, and the options object that we created beforehand.

fmincg works similarly to fminunc, but is more more efficient for dealing with a large number of parameters.

Multiclass classification: one-vs-all

$y$ has more values than only two 0 and 1. We keep using binary classification for each group of 2 (consider one and see the others as the other group)

(n+1)-values $y \Rightarrow n+1$ binary classification problems.

$$ \begin{align*} y &\in \lbrace 0, 1 ... n\rbrace \\ h_\theta^{(0)}(x) &= P(y = 0 | x ; \theta) \\ h_\theta^{(1)}(x) &= P(y = 1 | x ; \theta) \\ \cdots & \\ h_\theta^{(n)}(x) &= P(y = n | x ; \theta) \\ \mathrm{prediction} &= \max_i( h_\theta ^{(i)}(x) ) \end{align*} $$

One vs All 1

After fiding optTheta from fmincg, we need to find $h_{\theta}$. From $h_{\theta}$ for all classes, we find the one with the highest propability (highest $h$). That’s why we have the line of code prediction above.

Why max?

We want to choose a $\Theta$ such that for all $j\in \{ 0,\ldots,n \}$,
$h_{\theta}^{(j)} = P(y=j\vert x;\theta) \ge 0.5$
Don’t forget that, we consider $h_{\theta} \ge 0.5$ as true. Because of that, there is onlty 1 option, that’s max of all $j$.

See ex 4 for an example in practice.

Solving the problem of overfitting

Download Lecture 7.

The problem of overfitting

We have many features, $h_{\theta}$ may fit the training set very well ($J(\theta) \simeq 0$) but fail to generalize.

Overfitting

Overfitting 2

Cost function

Options to solve:

Reduce the number of features
- Manually select which features to keep
- By algorithm
Regularization (add penalty terms)
- Keep all features but reduce magnitude/values of parameters $\theta_j$
- Works well when we have a lot of features, each of which contributes a bit to predicting $y$

Overfitting 2

Because we need to find the minimum, we multiply $\theta_3, \theta_4$ by 1000 to make them very big and never be a min, i.e. they look like 0.

$$ J(\theta) = \frac{1}{2m} \left[ \sum_{i=1}^m (h_{\theta}(x^{(i)}) - y^{(i)})^2 + \lambda \sum_{j=1}^n \theta_j^2 \right] $$

If $\lambda$ is too large, the problem of underfitting occurs!

Regularized linear regression

Gradient Descent

See again GD in linear regression, multiple variables and logistic regression.

Repeat{

$\begin{align} \theta_0 &:= \theta_0 - \alpha \frac{1}{m} \sum_{i=1}^m (h_{\theta}(x^{(i)}) - y^{(i)}) x_0^{(i)} \\ \theta_j &:= \theta_j(1-\alpha\frac{\lambda}{m}) - \alpha \frac{1}{m} \sum_{i=1}^m (h_{\theta}(x^{(i)}) - y^{(i)}) x_j^{(i)}, j\in \{1,\ldots, n\} \end{align}$

}

Intuitively, reduce $\theta_j$ by some amount on every update, the second term is exactly the same it was before.

Normal equation

See again normal equation linear regression.

$$ \begin{align} \theta &= (X^TX + \lambda \cdot L)^{-1} X^T y \\ L &= \begin{bmatrix} 0 & & & & \newline & 1 & & & \newline & & 1 & & \newline & & & \ddots & \newline & & & & 1 \newline\end{bmatrix}_{(n+1)\times (n+1)} \end{align} $$

$X$ : $m\times (n+1)$ matrix
$m$ training examples, $n$ features.
We don’t include $x_0$.
If $m<n$ then $X^TX$ is non-invertible, but after adding $\lambda\cdot L$, $X^TX + \lambda\cdot L$ becomes invertible!

Regularized logistic regression

See again cost function for logistic regression.

$J(\theta) = - \frac{1}{m} \displaystyle \sum_{i=1}^m [y^{(i)}\log (h_\theta (x^{(i)})) + (1 - y^{(i)})\log (1 - h_\theta(x^{(i)}))]$

We can regularize this equation by adding a term to the end:

$$ J(\theta) = - \frac{1}{m} \displaystyle \sum_{i=1}^m [y^{(i)}\log (h_\theta (x^{(i)})) + (1 - y^{(i)})\log (1 - h_\theta(x^{(i)}))] + \dfrac{\lambda}{2m}\sum_{j=1}^n \theta_j^2. $$

And the gradient descent

Repeat{

}

The same form with GD regularized linear regression, the difference in this case is only the definition of $h_{\theta}(x)$

Exercice de programmation: Logistic Regression

Check instruction and explanation ex2.

See again How to submit?.

Logistic Regression

plotData: Plot from X, y to separate two kind of X

XPos = X(y==1, :);
XNeg = X(y==0, :);
plot(XPos(:,1), XPos(:,2), 'k+', 'LineWidth', 2, 'MarkerSize', 7);
plot(XNeg(:,1), XNeg(:,2), 'ko', 'MarkerFaceColor', 'y', 'MarkerSize', 7);

sigmoid.m: recall that
```
g = 1 ./ (1 + exp(-z));
```
costFunction.m: recall that, the cost function in logistic regression is

or in vectorization,

and its gradient is

or in vectorization,

$$ \nabla \theta = \dfrac{1}{m} X^T(g(X\theta) - y). $$
```
h = sigmoid(X*theta); % hypothesis
J = 1/m * ( -y' * log(h) - (1-y)' * log(1-h) );
grad = 1/m * X' * ( h - y);
```
fminunc:
- GradObj option to on, which tells fminunc that our function returns both the cost and the gradient
```
  options = optimset('GradObj', 'on', 'MaxIter', 400);
```
- Notice that by using fminunc, you did not have to write any loops yourself, or set a learning rate like you did for gradient descent.
predict.m: remember that,
```
h = sigmoid(X*theta); % m x 1
p = (h >= 0.5);
```

Regularized logistic regression

costFunctionReg.m: recall that,

$J(\theta) = - \frac{1}{m} \displaystyle \sum_{i=1}^m [y^{(i)}\log (h_\theta (x^{(i)})) + (1 - y^{(i)})\log (1 - h_\theta(x^{(i)}))] + \dfrac{\lambda}{2m}\sum_{j=1}^n \theta_j^2.$

its gradient,

$$ \begin{align} \dfrac{\partial J(\theta)}{\partial \theta_0} &= \dfrac{1}{m} \sum_{i=1}^m (h_{\theta}(x^{(i)}) - y^{(i)}) x_j^{(i)}, \text{ for } j=0 \\ \dfrac{\partial J(\theta)}{\partial \theta_j} &= \left( \dfrac{1}{m} \sum_{i=1}^m (h_{\theta}(x^{(i)}) - y^{(i)})x_j^{(i)} \right) + \dfrac{\lambda}{m}\theta_j, \text{ for } j\ge 1 \end{align} $$

h = sigmoid(X*theta); % hypothesis
J = 1/m * ( -y' * log(h) - (1-y)' * log(1-h) ) + lambda/(2*m) * sum(theta(2:end).^2);

grad(1,1) = 1/m * X(:,1)' * (h-y);
grad(2:end,1) = 1/m * X(:,2:end)' * (h-y) + lambda/m * theta(2:end,1);

Next to Week 4.