Logistic Regression
Implementation from scratch
Data set
We will use the well known Iris data set. It contains 3 classes of 50 instances each, where each class refers to a type of iris plant. To simplify things, we take just the first two feature columns. Also, the two non-linearly separable classes are labeled with the same category, ending up with a binary classification problem.
iris = sklearn.datasets.load_iris()
X = iris.data[:, :2]
y = (iris.target != 0) * 1

Algorithm
Given a set of inputs X, we want to assign them to one of two possible categories (0 or 1). Logistic regression models the probability that each input belongs to a particular category.
Hypothesis
A function takes inputs and returns outputs. To generate probabilities, logistic regression uses a function that gives outputs between 0 and 1 for all values of X. There are many functions that meet this description, but the used in this case is the logistic function. From here we will refer to it as sigmoid.

def sigmoid(z):
return 1 / (1 + np.exp(-z))
z = np.dot(X, theta)
h = sigmoid(z)
Loss function
Functions have parameters/weights (represented by theta in our notation) and we want to find the best values for them. To start we pick random values and we need a way to measure how well the algorithm performs using those random weights. That measure is computed using the loss function, defined as:
def loss(h, y):
return (-y * np.log(h) - (1 - y) * np.log(1 - h)).mean()
Gradient descent
Our goal is to minimize the loss function and the way we have to achive it is by increasing/decreasing the weights, i.e. fitting them. The question is, how do we know what parameters should be biggers and what parameters should be smallers? The answer is given by the derivative of the loss function with respect to each weight. It tells us how loss would change if we modified the parameters.
gradient = np.dot(X.T, (h - y)) / y.shape[0]
Then we update the weights by substracting to them the derivative times the learning rate.
lr = 0.01
theta -= lr * gradient
We should repeat this steps several times until we reach the optimal solution.
Predictions
By calling the sigmoid function we get the probability that some input x belongs to class 1. Let’s take all probabilities ≥ 0.5 = class 1 and all probabilities < 0 = class 0. This threshold should be defined depending on the business problem we were working.
def predict_probs(X, theta):
return sigmoid(np.dot(X, theta))
def predict(X, theta, threshold=0.5):
return predict_probs(X, theta) >= threshold
Putting it all together
class LogisticRegression:
def __init__(self, lr=0.01, num_iter=100000, fit_intercept=True, verbose=False):
self.lr = lr
self.num_iter = num_iter
self.fit_intercept = fit_intercept
def __add_intercept(self, X):
intercept = np.ones((X.shape[0], 1))
return np.concatenate((intercept, X), axis=1)
def __sigmoid(self, z):
return 1 / (1 + np.exp(-z))
def __loss(self, h, y):
return (-y * np.log(h) - (1 - y) * np.log(1 - h)).mean()
def fit(self, X, y):
if self.fit_intercept:
X = self.__add_intercept(X)
# weights initialization
self.theta = np.zeros(X.shape[1])
for i in range(self.num_iter):
z = np.dot(X, self.theta)
h = self.__sigmoid(z)
gradient = np.dot(X.T, (h - y)) / y.size
self.theta -= self.lr * gradient
if(self.verbose == True and i % 10000 == 0):
z = np.dot(X, self.theta)
h = self.__sigmoid(z)
print(f'loss: {self.__loss(h, y)} \t')
def predict_prob(self, X):
if self.fit_intercept:
X = self.__add_intercept(X)
return self.__sigmoid(np.dot(X, self.theta))
def predict(self, X, threshold):
return self.predict_prob(X) >= threshold
Evaluation
model = LogisticRegression(lr=0.1, num_iter=300000)
%time model.fit(X, y)
CPU times: user 13.8 s, sys: 84 ms, total: 13.9 s
Wall time: 13.8 s
preds = model.predict(X)
# accuracy
(preds == y).mean()
#> 1.0

Picking a learning rate = 0.1 and number of iterations = 300000 the algorithm classified all instances successfully. 13.8 seconds were needed. These are the resulting weights:
model.theta
array([-25.96818124, 12.56179068, -13.44549335])
LogisticRegression from sklearn:
model = LogisticRegression(C=1e20)
%time model.fit(X, y)
CPU times: user 0 ns, sys: 0 ns, total: 0 ns
Wall time: 854 µs
preds = model.predict(X)
# accuracy
(preds == y).mean()
#> 1.0
model.intercept_, model.coef_
(array([-80.62725491]), array())
If we trained our implementation with smaller learning rate and more iterations we would find approximately equal weights. But the more remarkably difference is about training time, sklearn is order of magnitude faster. Anyway, is not the intention to put this code on production, this is just a toy exercice with teaching objectives.
Further steps could be the addition of l2 regularization and multiclass classification.