Logistic Regression

We can generalize linear regression to the classiﬁcation scenario by deﬁning a diﬀerent family of probability distributions. If we have two classes, class 0 and class 1, then we need only specify the probability of one of these classes. The probability of class 1 determines the probability of class 0, because these two values must add up to 1.The normal distribution over real-valued numbers that we used for linear regression is parametrized in terms of a mean. Any value we supply for this mean is valid. A distribution over a binary variable is slightly more complicated, because its mean must always be between 0 and 1. One way to solve this problem is to use the logistic sigmoid function to squash the output of the linear function into the interval (0, 1) and interpret that value as a probability:

$p(y=1|x; \theta)=\sigma(\theta^Tx)$

This approach is known as logistic regression(a somewhat strange name since we use the model for classiﬁcation rather than regression).

Cost Function

We cannot use the same cost function that we use for linear regressionbecause the Logistic Function will cause the output to be wavy, causing many local optima. In other words, it will not be a convex function.

Instead, our cost function for logistic regression looks like:

$J(\theta)=\frac{1}{m}\sum_{i=1}^{m}Cost(h_\theta(x^{(i)}),y^{(i)})$

$Cost(h_\theta(x),y)=-\log(h_\theta(x))$ if $y=1$

$Cost(h_\theta(x),y)=-\log(1-h_\theta(x))$ if $y=0$

Simplified Cost Function and Gradient Descent

We can compress our cost function's two conditional cases into one case:

$Cost(h_\theta(x),y)=-y\log(h_\theta(x))-(1-y)\log(1-h_\theta(x))$

Notice that when y is equal to 1, then the second term $(1-y)\log(1-h_\theta(x))$ will be zero and will not affect the result. If y is equal to 0, then the first term $-y\log(h_\theta(x))$ will be zero and will not affect the result.

We can fully write out our entire cost function as follows:

$J(\theta)=-\frac{1}{m}\sum_{i=1}^{m}[y^{(i)}\log(h_\theta(x^{(i)}))+(1-y)\log(1-h_\theta(x^{(i)})]$

A vectorized implementation is:

$h=g(X\theta)$ $J(\theta)=\frac{1}{m}(-y^T\log(h)-(1-y)^T\log(1-h))$

Gradient Descent

Remember that the general form of gradient descent is:

Repeat:

$\{$

$\theta_j:=\theta_j-\alpha\frac{\delta}{\delta\theta_j}J(\theta)$

$\}$

We can work out the derivative part using calculus to get:

Repeat:

$\{$

$\theta_j:=\theta_j-\frac{\alpha}{m}\sum_{i=1}^{m}(h_\theta(x^{(i)})-y^{(i)})x_j^{(i)}$

$\}$

Notice that this algorithm is identical to the one we used in linear regression. We still have to simultaneously update all values in theta.

A vectorized implementation is:

$\theta:=\theta-\frac{\alpha}{m}X^T(g(X\theta)-\overrightarrow{y})$

Multiclass Classification: One-vs-all

Now we will approach the classification of data when we have more than two categories. Instead of $y=\{0,1\}$ we will expand our definition so that $y=\{0,1\dots n \}$ .

Since $y=\{0,1\dots n \}$ , we divide our problem into $n+1$ (+1 because the index starts at 0) binary classification problems; in each one, we predict the probability that $y$ is a member of one of our classes.

$y\in\{0,1\dots n\}$

$h_\theta^{(0)}(x)=P(y=0|x;\theta)$

$h_\theta^{(1)}(x)=P(y=1|x;\theta)$

$\dots$

$h_\theta^{(n)}(x)=P(y=n|x;\theta)$

$prediction=\max_{i}(h_\theta^{(i)}(x))$

We are basically choosing one class and then lumping all the others into a single second class. We do this repeatedly, applying binary logistic regression to each case, and then use the hypothesis that returned the highest value as our prediction.

Train a logistic regression classifier $h_\theta(x)$ for each class to predict the probability that $y=i$ . To make a prediction on a new x, pick the class that maximizes $h_\theta(x)$ .

Page structure

Concept map →