Optimizers - Gradient descent algorithms ( Part 1)

Hey everyone ! Welcome to my blog ! We are going to see the implementation of some of the basic optimiser algorithms in this blog. In machine learning, weights and biases are the learnable parameters (Theta θ) of the machine learning/deep learning models.Optimisers are algorithms that are used to modify the model parameters - weights and biases to minimise the loss function. Usually at the start of the training of a model, the weights and biases are initialised randomly. During training, the loss function is calculated and these parameters are updated by the optimisers so that the value of the loss function is minimum.

In this part 1, we will be learn and implement the following algorithms from scratch in python.

Gradient Descent
Stochastic Gradient Descent (SGD)
Batch Gradient Descent
Mini Batch Stochastic Gradient Descent

We are going to discuss these algorithms for a linear regression model with a single feature. Let us consider the equation

                    y = 2.5X + 1        (1)

where m = 2.5 and b = 1

Our X has single feature with 100 data points. We are going to populate X with random values between -1 and 1. We can then calculate y using the equation (1) shown above.

import numpy as np
import random
# set seed for reproducibilty
np.random.seed(1) 
num_samples = 100
X = np.random.uniform(-1.,1.,num_samples)
m = 2.5
b = 1
y = m*X +b

Let us see the visualisation of this function expressed by equation (1)

Screenshot 2021-11-10 at 11.37.24 AM.png

We are going to see the implementation of the various optimiser algorithms to calculate m and b. We will be using the Mean Squared Error (MSE) described by the equation in the image given below as the objective function (L). Basically we will be using these optimiser algorithms to minimise the loss function.

Gradient Descent :

In gradient descent, the gradient of the objective function (L) with respect to the parameters theta (θ ) is calculated and the the parameters are updated in the opposite direction of the gradient of the objective function. The learning rate α determines the size of the steps to be taken to reach the local minimum.

Screenshot 2021-11-10 at 11.52.28 AM.png

Based on how much data is used for calculating the gradient for weight updates, we have different variants which we will be discussing in detail

Batch Gradient Descent :

In Batch Gradient Descent, for each epoch, we calculate the gradient for the objective function for the entire dataset w.r.to the parameters. Therefore the update for the parameters happens once per epoch. Batch Gradient Descent is also called as the vanilla gradient descent.

For our MSE objective function, the gradients with respect to m and b are as shown below. If you are interested in understanding how to calculate the gradients , please click here

Python code for batch gradient descent

import numpy as np
from sklearn.metrics import mean_squared_error

def batch_gradient_descent(X, y, lr, epochs): 
    m, b = 0.33, 0.48 # initialising the parameters
    # log stores m and b values for differnt updates 
    # mse stores the error
    log, mse = [], [] # lists to store learning process 
    N = len(X) # number of samples

    for _ in range(epochs):               
        f = y - (m*X + b)   
        # Updating m and b
        m -= lr * (-2 * X.dot(f).sum() / N)
        b -= lr * (-2 * f.sum() / N)

        log.append((m, b))
        mse.append(mean_squared_error(y, (m*X + b)))        
    return m, b, log, mse
`

The plot below shows the MSE Batch Gradient descent for lr = 0.1 and epochs = 75.

Screenshot 2021-11-10 at 1.21.40 PM.png

NOTE : The parameters are changed after calculating the gradient on the whole dataset. In case of a huge dataset with. many features, this is very time consuming. It also requires significantly more memory to calculate the gradient on the whole dataset.

Stochastic Gradient Descent :

In Stochastic Gradient Descent, one sample is chosen at random from the whole set for an epoch. Gradients are calculated for that particular and sample and weights are updated.

Python code for Stochastic Gradient Descent

import numpy as np
from sklearn.metrics import mean_squared_error
def SGD(X, y, lr, epochs):
    m, b = 0.5, 0.5  # initial parameters
    log, mse = [], [] # lists to store learning process

    for _ in range(epochs):
        indexes = np.random.randint(0, len(X)) # random sample
        Xs = np.take(X, indexes)
        ys = np.take(y, indexes)
        N = len(X)
        f = ys - (m*Xs + b)

        # Updating parameters m and b
        m -= lr * (-2 * Xs*(f).sum() / N)
        b -= lr * (-2 * f.sum() / N)

        log.append((m, b))
        mse.append(mean_squared_error(y, m*X+b))        

    return m, b, log, mse

The plot below shows MSE vs epochs for lr = 0.85 and epochs = 500. We can observe that we were able to arrive at the local minima with a higher learning rate and more number of epochs for our example.

Screenshot 2021-11-10 at 1.21.33 PM.png

Mini-Batch gradient Descent :

In mini-batch gradient descent, updates are done for a mini-batch of samples. In our example we have 100 samples. So with a batch-size of 10, we have 100 updates in 10 epochs. Mini-Batch gradient descent is the algorithm of choice when training a neural network.

Python code for mini-batch gradient descent

def minibatchgd(X, y, lr, epochs, batch_size):
    m, b = 0.5, 0.5 # initial parameters
    log, mse = [], [] # lists to store learning process
    for _ in range(epochs):
        total_len = len(X)
        for i in range(0, total_len, batch_size):
            Xs = X[i:i+batch_size]
            ys = y[i:i+batch_size]
            N = len(Xs)
            f = ys - (m*Xs + b)
            # Updating parameters m and b
            m -= lr * (-2 * Xs.dot(f).sum() / N)
            b -= lr * (-2 * f.sum() / N)
            log.append((m, b))
            mse.append(mean_squared_error(y, m*X+b))        

    return m, b, log, mse

The plot for mini-batch gradient descent for epochs = 10 with a lr = 0.5 and batch_size = 10 is given below.

Mini batch gradient descent where the data is shuffled randomly and then a mini-batch of data is selected is also called as mini-batch SGD.

To sum up, for a training dataset of 'm' samples, if the

mini batch size = 'm'. - Batch Gradient Descent ( all data is taken for one epoch)
mini batch size = '1' - Stochastic Gradient Descent ( one sample from all the data for one epoch)
mini batch size = 'm'/100 - mini batch Gradient Descent ( m/100 samples taken for one epoch )

Important things to note:

The hyper parameters learning rate, epochs, batchsize have to be changed so that the MSE reaches to a minimum (ideally 0).
We can observe that the number of epochs in mini batch gradient descent are significantly lower than the batch gradient descent and stochastic gradient descent. This implies that the minima is reached with lesser training time.
Learning rate has to be chosen carefully and it is quite a challenging task.

In the next part , we will be discussing some sophisticated optimizer algorithms. The code is available at https://github.com/bhuvanakundumani/optimizers.git

References: https://ruder.io/optimizing-gradient-descent/ https://www.kdnuggets.com/2020/12/optimization-algorithms-neural-networks.html https://towardsdatascience.com/gradient-descent-from-scratch-e8b75fa986cc

Optimizers - Gradient descent algorithms ( Part 1)

Comments

More from this blog

Optimizers - RMSprop and Adam algorithms (Part 3)

Optimizers - Momentum and Nesterov momentum algorithms (Part 2)

Command Palette

Comments

More from this blog