Optimizers - RMSprop and Adam algorithms (Part 3)

Welcome to the third part on optimizers where we will be discussing RMSprop and Adam algorithms in detail. If you want a quick review of vanilla gradient descent algorithms and its variants, please read about it in part1 . Momentum and Nesterov momentum are discussed in part2.

RMSprop :

In RMS prob we calculate the exponentially weighted averages of the square of the gradients. The RMSprop updates are calculated as given in the equations below.

We want the movements in the vertical direction to be minimal and movements in the horizontal direction to be much larger in order to dampen the oscillations. Refer the figure given below.

( image source : https://www.coursera.org/learn/deep-neural-network/lecture/BhJlm/rmsprop)

Referring to the RMSprop equations, due to oscillations more in the vertical directions, db is larger than dw, assuming b is in the vertical direction and w is in the horizontal direction as shown in the figure above. Hence, V_db, is relatively larger than V_dw. Since denominator V_db is big, it results in a small update for b. Thus it dampens the oscillations in the vertical direction. At the same time, the updates in horizontal direction are divided by small number and hence the updates are much bigger for w than b. To ensure numerical stability, we add epsilon in the denominator.

RMSprop allows us to have a larger learning rate which helps in faster learning as well.

Python code for RMSprop

# we are not using nesterov momentum 
def minibatchsgd_rmsprop(X, y, lr, epochs, batch_size, momentum, epsilon):
    m, b = 0.5, 0.5 # initial parameters
    log, mse = [], [] # lists to store learning process
    v_m = 0
    v_b = 0
    for epoch in range(epochs):
        total_len = len(X)
        for i in range(0, total_len, batch_size):
            Xs = X[i:i+batch_size]
            ys = y[i:i+batch_size]            
            N = len(Xs)
            f = ys - (m*Xs + b)
            gradient_m = (-2 * Xs.dot(f).sum() / N)
            gradient_b =  (-2 * f.sum() / N)

            v_m = (momentum*v_m) + (1 - momentum) * (gradient_m **2)
            v_b =  (momentum*v_b) + (1 - momentum) * (gradient_b **2)

            m = m - ( (lr * gradient_m)/( v_m ** 0.5 + epsilon))
            b = b - ( (lr * gradient_b)/( v_m ** 0.5 + epsilon))
            log.append((m, b))
            mse.append(mean_squared_error(y, (m*X + b)))        
    return m, b, log, mse

The plot MSE vs total updates for momentum = 0.9, lr = 0.05 (learning rate is not scaled by (1-momentum), epochs = 2, batch_size = 10 and epsilon = 1e-08 is as given below.

Adam :

RMSprop and Adam have been a choice of algorithms for most usecases. In Adam, decaying averages of past and past squared gradients are calculated using β₁ and β₂ as shown below.

g_t is the gradient of the objective function. Adam combines momentum and RMSprop. Equation 1 is from momentum and Equation 2 is from RMSprop.

Initially, m_t-1 and v_t-1 are set to 0. Since m_t-1 and v_t-1 are initialised to zero, they are biased towards zero in the initial steps. Hence they add a bias correction for m_t and v_t as shown below. To know more about bias correction please check out this video .

Then the weights are updates using the update rule

Python code for Adam

def minibatchsgd_adam(X, y, lr, epochs, batch_size, beta1, beta2, epsilon):
    m, b = 0.5, 0.5 # initial parameters
    log, mse = [], [] # lists to store learning process
    m_m = 0
    m_b = 0
    v_m = 0
    v_b = 0
    number_of_iterations = 0
    total_len = len(X)

    for epoch in range(epochs):
        for i in range(0, total_len, batch_size):
            number_of_iterations += 1
            Xs = X[i:i+batch_size]
            ys = y[i:i+batch_size]            
            N = len(Xs)
            f = ys - (m*Xs + b)
            gradient_m = (-2 * Xs.dot(f).sum() / N)
            gradient_b = (-2 * f.sum() / N)

            m_m = (beta1*m_m) + (1-beta1)*gradient_m 
            m_b = (beta1*m_b) + (1-beta1)*gradient_b
            v_m = (beta2*v_m) + (1 - beta2) * (gradient_m **2)
            v_b =  (beta2*v_b) + (1 - beta2) * (gradient_b **2)

            m_m_hat = m_m / (1- (beta1**number_of_iterations))
            m_b_hat = m_b / (1- (beta1**number_of_iterations)) 
            v_m_hat = v_m / (1-(beta2**number_of_iterations))
            v_b_hat = v_b / (1-(beta2**number_of_iterations))


            m = m - ( (lr * m_m_hat)/ ( (v_m_hat** 0.5) + epsilon))
            b = b - ( (lr * m_b_hat)/( (v_b_hat** 0.5) + epsilon))

            log.append((m, b))
            mse.append(mean_squared_error(y, (m*X + b)))     

    return number_of_iterations, m, b, log, mse

The plot for total updates vs MSE for beta1 = 0.9, beta2 = 0.999, lr = 0.05 , epochs = 5, batch_size = 5 epsilon = 1e-08 is given below.

The main aim of this blog is to implement the algorithms in Python. Since our dataset is a toy dataset, we are not seeing a significant increase in the training time. Thank you for reading my blog on optimizers. The code is available at https://github.com/bhuvanakundumani/optimizers.git.

References :

https://ruder.io/optimizing-gradient-descent/

https://towardsdatascience.com/adam-latest-trends-in-deep-learning-optimization-6be9a291375c

https://www.coursera.org/lecture/deep-neural-network/rmsprop-BhJlm

https://deepnotes.io/sgd-momentum-adaptive

Optimizers - RMSprop and Adam algorithms (Part 3)

Comments

More from this blog

Optimizers - Momentum and Nesterov momentum algorithms (Part 2)

Optimizers - Gradient descent algorithms ( Part 1)

Command Palette

Comments

More from this blog