Optimizers - Gradient descent algorithms ( Part 1)
I am a self taught NLP research engineer who believes in life long learning. I love exploring and experimenting new approaches and algorithms to solve challenging problems.
I am a fitness enthusiast and love travelling.
Hey everyone ! Welcome to my blog ! We are going to see the implementation of some of the basic optimiser algorithms in this blog. In machine learning, weights and biases are the learnable parameters (Theta θ) of the machine learning/deep learning models.Optimisers are algorithms that are used to modify the model parameters - weights and biases to minimise the loss function. Usually at the start of the training of a model, the weights and biases are initialised randomly. During training, the loss function is calculated and these parameters are updated by the optimisers so that the value of the loss function is minimum.
In this part 1, we will be learn and implement the following algorithms from scratch in python.
- Gradient Descent
- Stochastic Gradient Descent (SGD)
- Batch Gradient Descent
- Mini Batch Stochastic Gradient Descent
We are going to discuss these algorithms for a linear regression model with a single feature. Let us consider the equation
y = 2.5X + 1 (1)
where m = 2.5 and b = 1
Our X has single feature with 100 data points. We are going to populate X with random values between -1 and 1. We can then calculate y using the equation (1) shown above.
import numpy as np
import random
# set seed for reproducibilty
np.random.seed(1)
num_samples = 100
X = np.random.uniform(-1.,1.,num_samples)
m = 2.5
b = 1
y = m*X +b
Let us see the visualisation of this function expressed by equation (1)

We are going to see the implementation of the various optimiser algorithms to calculate m and b. We will be using the Mean Squared Error (MSE) described by the equation in the image given below as the objective function (L). Basically we will be using these optimiser algorithms to minimise the loss function.

Gradient Descent :
In gradient descent, the gradient of the objective function (L) with respect to the parameters theta (θ ) is calculated and the the parameters are updated in the opposite direction of the gradient of the objective function. The learning rate α determines the size of the steps to be taken to reach the local minimum.

Based on how much data is used for calculating the gradient for weight updates, we have different variants which we will be discussing in detail
Batch Gradient Descent :
In Batch Gradient Descent, for each epoch, we calculate the gradient for the objective function for the entire dataset w.r.to the parameters. Therefore the update for the parameters happens once per epoch. Batch Gradient Descent is also called as the vanilla gradient descent.
For our MSE objective function, the gradients with respect to m and b are as shown below. If you are interested in understanding how to calculate the gradients , please click here


Python code for batch gradient descent
import numpy as np
from sklearn.metrics import mean_squared_error
def batch_gradient_descent(X, y, lr, epochs):
m, b = 0.33, 0.48 # initialising the parameters
# log stores m and b values for differnt updates
# mse stores the error
log, mse = [], [] # lists to store learning process
N = len(X) # number of samples
for _ in range(epochs):
f = y - (m*X + b)
# Updating m and b
m -= lr * (-2 * X.dot(f).sum() / N)
b -= lr * (-2 * f.sum() / N)
log.append((m, b))
mse.append(mean_squared_error(y, (m*X + b)))
return m, b, log, mse
`
The plot below shows the MSE Batch Gradient descent for lr = 0.1 and epochs = 75.

NOTE : The parameters are changed after calculating the gradient on the whole dataset. In case of a huge dataset with. many features, this is very time consuming. It also requires significantly more memory to calculate the gradient on the whole dataset.
Stochastic Gradient Descent :
In Stochastic Gradient Descent, one sample is chosen at random from the whole set for an epoch. Gradients are calculated for that particular and sample and weights are updated.
Python code for Stochastic Gradient Descent
import numpy as np
from sklearn.metrics import mean_squared_error
def SGD(X, y, lr, epochs):
m, b = 0.5, 0.5 # initial parameters
log, mse = [], [] # lists to store learning process
for _ in range(epochs):
indexes = np.random.randint(0, len(X)) # random sample
Xs = np.take(X, indexes)
ys = np.take(y, indexes)
N = len(X)
f = ys - (m*Xs + b)
# Updating parameters m and b
m -= lr * (-2 * Xs*(f).sum() / N)
b -= lr * (-2 * f.sum() / N)
log.append((m, b))
mse.append(mean_squared_error(y, m*X+b))
return m, b, log, mse
The plot below shows MSE vs epochs for lr = 0.85 and epochs = 500. We can observe that we were able to arrive at the local minima with a higher learning rate and more number of epochs for our example.

Mini-Batch gradient Descent :
In mini-batch gradient descent, updates are done for a mini-batch of samples. In our example we have 100 samples. So with a batch-size of 10, we have 100 updates in 10 epochs. Mini-Batch gradient descent is the algorithm of choice when training a neural network.
Python code for mini-batch gradient descent
def minibatchgd(X, y, lr, epochs, batch_size):
m, b = 0.5, 0.5 # initial parameters
log, mse = [], [] # lists to store learning process
for _ in range(epochs):
total_len = len(X)
for i in range(0, total_len, batch_size):
Xs = X[i:i+batch_size]
ys = y[i:i+batch_size]
N = len(Xs)
f = ys - (m*Xs + b)
# Updating parameters m and b
m -= lr * (-2 * Xs.dot(f).sum() / N)
b -= lr * (-2 * f.sum() / N)
log.append((m, b))
mse.append(mean_squared_error(y, m*X+b))
return m, b, log, mse
The plot for mini-batch gradient descent for epochs = 10 with a lr = 0.5 and batch_size = 10 is given below.
Mini batch gradient descent where the data is shuffled randomly and then a mini-batch of data is selected is also called as mini-batch SGD.

To sum up, for a training dataset of 'm' samples, if the
- mini batch size = 'm'. - Batch Gradient Descent ( all data is taken for one epoch)
- mini batch size = '1' - Stochastic Gradient Descent ( one sample from all the data for one epoch)
- mini batch size = 'm'/100 - mini batch Gradient Descent ( m/100 samples taken for one epoch )
Important things to note:
- The hyper parameters learning rate, epochs, batchsize have to be changed so that the MSE reaches to a minimum (ideally 0).
- We can observe that the number of epochs in mini batch gradient descent are significantly lower than the batch gradient descent and stochastic gradient descent. This implies that the minima is reached with lesser training time.
- Learning rate has to be chosen carefully and it is quite a challenging task.
In the next part , we will be discussing some sophisticated optimizer algorithms. The code is available at https://github.com/bhuvanakundumani/optimizers.git
References: https://ruder.io/optimizing-gradient-descent/ https://www.kdnuggets.com/2020/12/optimization-algorithms-neural-networks.html https://towardsdatascience.com/gradient-descent-from-scratch-e8b75fa986cc