Optimizers - RMSprop and Adam algorithms (Part 3)
I am a self taught NLP research engineer who believes in life long learning. I love exploring and experimenting new approaches and algorithms to solve challenging problems.
I am a fitness enthusiast and love travelling.
Welcome to the third part on optimizers where we will be discussing RMSprop and Adam algorithms in detail. If you want a quick review of vanilla gradient descent algorithms and its variants, please read about it in part1 . Momentum and Nesterov momentum are discussed in part2.
RMSprop :
In RMS prob we calculate the exponentially weighted averages of the square of the gradients. The RMSprop updates are calculated as given in the equations below.

We want the movements in the vertical direction to be minimal and movements in the horizontal direction to be much larger in order to dampen the oscillations. Refer the figure given below.
( image source : https://www.coursera.org/learn/deep-neural-network/lecture/BhJlm/rmsprop)
Referring to the RMSprop equations, due to oscillations more in the vertical directions, db is larger than dw, assuming b is in the vertical direction and w is in the horizontal direction as shown in the figure above. Hence, Vdb , is relatively larger than Vdw . Since denominator Vdb is big, it results in a small update for b. Thus it dampens the oscillations in the vertical direction. At the same time, the updates in horizontal direction are divided by small number and hence the updates are much bigger for w than b. To ensure numerical stability, we add epsilon in the denominator.
RMSprop allows us to have a larger learning rate which helps in faster learning as well.
Python code for RMSprop
# we are not using nesterov momentum
def minibatchsgd_rmsprop(X, y, lr, epochs, batch_size, momentum, epsilon):
m, b = 0.5, 0.5 # initial parameters
log, mse = [], [] # lists to store learning process
v_m = 0
v_b = 0
for epoch in range(epochs):
total_len = len(X)
for i in range(0, total_len, batch_size):
Xs = X[i:i+batch_size]
ys = y[i:i+batch_size]
N = len(Xs)
f = ys - (m*Xs + b)
gradient_m = (-2 * Xs.dot(f).sum() / N)
gradient_b = (-2 * f.sum() / N)
v_m = (momentum*v_m) + (1 - momentum) * (gradient_m **2)
v_b = (momentum*v_b) + (1 - momentum) * (gradient_b **2)
m = m - ( (lr * gradient_m)/( v_m ** 0.5 + epsilon))
b = b - ( (lr * gradient_b)/( v_m ** 0.5 + epsilon))
log.append((m, b))
mse.append(mean_squared_error(y, (m*X + b)))
return m, b, log, mse
The plot MSE vs total updates for momentum = 0.9, lr = 0.05 (learning rate is not scaled by (1-momentum), epochs = 2, batch_size = 10 and epsilon = 1e-08 is as given below.

Adam :
RMSprop and Adam have been a choice of algorithms for most usecases. In Adam, decaying averages of past and past squared gradients are calculated using β1 and β2 as shown below.

gt is the gradient of the objective function. Adam combines momentum and RMSprop. Equation 1 is from momentum and Equation 2 is from RMSprop.
Initially, mt-1 and vt-1 are set to 0. Since mt-1 and vt-1 are initialised to zero, they are biased towards zero in the initial steps. Hence they add a bias correction for mt and vt as shown below. To know more about bias correction please check out this video .

Then the weights are updates using the update rule

Python code for Adam
def minibatchsgd_adam(X, y, lr, epochs, batch_size, beta1, beta2, epsilon):
m, b = 0.5, 0.5 # initial parameters
log, mse = [], [] # lists to store learning process
m_m = 0
m_b = 0
v_m = 0
v_b = 0
number_of_iterations = 0
total_len = len(X)
for epoch in range(epochs):
for i in range(0, total_len, batch_size):
number_of_iterations += 1
Xs = X[i:i+batch_size]
ys = y[i:i+batch_size]
N = len(Xs)
f = ys - (m*Xs + b)
gradient_m = (-2 * Xs.dot(f).sum() / N)
gradient_b = (-2 * f.sum() / N)
m_m = (beta1*m_m) + (1-beta1)*gradient_m
m_b = (beta1*m_b) + (1-beta1)*gradient_b
v_m = (beta2*v_m) + (1 - beta2) * (gradient_m **2)
v_b = (beta2*v_b) + (1 - beta2) * (gradient_b **2)
m_m_hat = m_m / (1- (beta1**number_of_iterations))
m_b_hat = m_b / (1- (beta1**number_of_iterations))
v_m_hat = v_m / (1-(beta2**number_of_iterations))
v_b_hat = v_b / (1-(beta2**number_of_iterations))
m = m - ( (lr * m_m_hat)/ ( (v_m_hat** 0.5) + epsilon))
b = b - ( (lr * m_b_hat)/( (v_b_hat** 0.5) + epsilon))
log.append((m, b))
mse.append(mean_squared_error(y, (m*X + b)))
return number_of_iterations, m, b, log, mse
The plot for total updates vs MSE for beta1 = 0.9, beta2 = 0.999, lr = 0.05 , epochs = 5, batch_size = 5 epsilon = 1e-08 is given below.

The main aim of this blog is to implement the algorithms in Python. Since our dataset is a toy dataset, we are not seeing a significant increase in the training time. Thank you for reading my blog on optimizers. The code is available at https://github.com/bhuvanakundumani/optimizers.git.
References :
https://ruder.io/optimizing-gradient-descent/
https://towardsdatascience.com/adam-latest-trends-in-deep-learning-optimization-6be9a291375c
https://www.coursera.org/lecture/deep-neural-network/rmsprop-BhJlm
https://deepnotes.io/sgd-momentum-adaptive