# Optimization Algorithms - A Brief Overview:-

In this blog post, I would like to give a brief overview of the existing gradient descent optimization algorithms that are available. There are lots of good resources available online. You can check them at the References section at the end of this post.

The existing TMVA submodule has always used gradient descent to update the parameters and minimize the cost of the neural networks. More advanced optimization methods can speed up learning and perhaps even get you to a better final value for the cost function. Having a good optimization algorithm can be the difference between waiting days vs. just a few hours to get a good result.

Gradient descent is a first-order iterative optimization algorithm for finding the minimum of a function. To find a local minimum of a function using gradient descent, one takes steps proportional to the negative of the gradient (or approximate gradient) of the function at the current point. Gradient descent goes “downhill” on a cost function J. Think of it as trying to do this: There are three variants of gradient descent, which depends on how much data you use to cacluate the gradients and perform an update. They are as follows,

Vanilla gradient descent, also known as batch gradient descent, computes the gradient of the cost function w.r.t. to the parameters $\theta$ for the entire training dataset. It supports maximum vectorization, but if the data is large, it cannot fit into the memory.

Stochastic gradient descent (SGD) in contrast performs a parameter update for each training example of the training dataset. It cannot exploit vectorization, since it has to iterate through all the training examples and make an update for each training example. It also shows a lot of fluctuations before converging to the solution.

Mini-batch gradient descent finally takes the best of both approaches and performs an update for every mini-batch of n training examples. The size of the mini-batch is usually in the power of 2 like 64 or 256, but can vary depending on the applications. It exploits vectorization to some extent and its update is also fast. It is the most preferred way of update among these variants.

Here, I’ll discuss about the various gradient descent optimization algorithms that are proven to work best in most of the applications.

### 1) Momentum based update:

Momentum is a method that helps accelerate SGD in the relevant direction and dampens oscillations as can be seen in the below image. It does this by adding a fraction $\gamma$ of the update vector of the past time step to the current update vector.

SGD without Momentum: SGD with Momentum: The momentum update is done as follows,

The usual value of $\gamma$ is 0.9. ie ( $\gamma$ < 1 )

### 2) Nesterov accelerated Momentum:

Ilya Sutskever suggested a new form of momentum that often works better. It is inspired by the nesterov method for optimizing convex functions. First, make a big jump in the direction of the previous accumulated gradient. Then, measure the gradient where you end up and make correction. Its better to correct a mistake after you have made it. :P

Nesterov Update: Here, brown vector = jump, red vector = correction, green vector = accumulated gradient, blue vector = standard momentum.

The Nesterov update is done as follows,

The usual value of $\gamma$ is 0.9. ie ( $\gamma$ < 1 ) and it depends on the application.

AdaGrad is an optimization method that allows different step sizes for different features. It increases the influence of rare but informative features i.e. It adapts the learning rate to the parameters, performing larger updates for infrequent and smaller updates for frequent parameters. For this reason, it is well-suited for dealing with sparse data.

where,
$G_{t}$ - Sum of the squares of the past gradients w.r.t all parameters $\theta$ along its diagonal.
$\odot$ - Matrix-vector dot product.
$g_{t}$ - Gradient at time step $t$.
$\eta$ - Learning rate.
$\epsilon$ - Smoothing term that avoids division by zero and is usually of the order of $1e-8$.

Adadelta is an extension of Adagrad that tries to reduce the monotonically decreasing learning rate. Instead of accumulating all past squared gradients, Adadelta stores only the window of accumulated past gradients to some fixed window size $w$.

And Instead of storing all the past gradients of window size w, it stores the decaying average of the past squared gradients. The running average $E[g^2]_t$ at time step $t$ is calculated as,

The root mean squared (RMS) error of the gradient is therefore,

And also the decaying average of the past squared updates is computed as,

The root mean squared (RMS) error of the updates is therefore,

Since $RMS[\Delta \theta]_{t}$ is unknown, we approximate it with the RMS of parameter updates until the previous time step i.e. $RMS[\Delta \theta]_{t-1}$

Here, the parameters usually take the default value,
$\gamma$ - Usually around 0.9.
$\epsilon$ - Smoothing term that avoids division by zero and is usually of the order of $1e-8$.

### 5) RMSprop:

RMSprop is an unpublished, adaptive learning rate method proposed by Geoff Hinton. The main idea is “Divide the gradient by a running average of its recent magnitude”. It is similar to Adadelta but it is developed independently to overcome the disadvantages of the Adagrad algorithm.

The RMSprop update is done as follows,

where,
$\gamma$ - Usually around 0.9.
$\eta$ - Learning rate, usually around 0.001.
$E[g^2]_t$ - Decaying average of the past squared gradients at time step $t$.

Adaptive Moment Estimation (Adam) is another method that computes adaptive learning rates for each parameter. It stores both the decaying average of the past gradients $m_t$, similar to momentum and also the decaying average of the past squared gradients $v_t$, similar to RMSprop and Adadelta. Thus, it combines the advantages of both the methods. Adam is the default choice of the optimizer for any application in general.

The decaying average of the past gradients $m_t$ and the past squared gradients $v_t$ is computed as follows,

And since these $m_t$ and $v_t$ are initialized with zeros, they are biased towards zero, especially during the initial time steps. Thus, to avoid these biases, the bias corrected versions of them are computed as follows,

Thus, the Adam update is as follows,

where,
$\beta_1$ - usually 0.9.
$\beta_2$ - usually 0.999.
$\eta$ - Learning rate.
$\epsilon$ - usually of the order of $1e-8$.

Adamax is the generalization of the Adam algorithm to the $\ell_{\infty}$ norm. Kingma and Ba show that $v_t$ with $\ell_{\infty}$ converges to the more stable value.

The infinity norm-constrained $v_t$ is denoted as $u_t$ and is computed as follows,

Here, since $u_t$ relies on the max operation, it is not biased towards zero unlike the ones in the Adam algorithm.

Thus, the Adamax update is as follows,

where,
$\beta_1$ - usually 0.9.
$\beta_2$ - usually 0.999.
$\eta$ - Learning rate, usually 0.002.

Nadam is similar to Adam which is a combination of the momentum and the RMSprop. Nadam can be viewed as a combination of the nesterov accelerated momentum and the RMSprop. Here, we do not need to modify the $\hat{v}_t$. The momentum vector equations are as follows,

Expanding the last equation gives,

But since,

we get,

We can now add Nesterov momentum just as we did previously by simply replacing this bias-corrected estimate of the momentum vector of the previous time step $\hat{m}_{t−1}$ with the bias-corrected estimate of the current momentum vector $\hat{m}_t$.

Thus, the Nadam update is as follows,

### Summary of the various update equations : 3) RMSProp