AdaGrad is an optimization method that allows different step sizes for different features. It increases the influence of rare but informative features i.e. It adapts the learning rate to the parameters, performing larger updates for infrequent and smaller updates for frequent parameters. For this reason, it is well-suited for dealing with sparse data.

Thus, the update is implemented as follows, ( similar to the tensorflow implementation )

Vt = Vt-1 + currentSquaredGradients
theta = theta - learningRate * currentGradients / (sqrt(Vt + epsilon))


So, one step of update is performed as,

I used the same unit tests approach as for SGD optimizer. Have a look at Testing the SGD optimizer post.

The above figure shows the convergence of the training and testing errors for the Adagrad Optimizer during the unit tests.

Adadelta is an extension of Adagrad that tries to reduce the monotonically decreasing learning rate. Instead of accumulating all past squared gradients, Adadelta stores only the window of accumulated past gradients to some fixed window size $w$.

Thus, the update is implemented as follows, ( similar to the tensorflow implementation )

Vt = rho * Vt-1 + (1-rho) * currentSquaredGradients
theta = theta - learningRate * currentUpdates
Wt = rho * Wt-1 + (1-rho) * currentSquaredUpdates



So, one step of update is performed as,

I used the same unit tests approach as for SGD optimizer. Have a look at Testing the SGD optimizer post.

The above figure shows the convergence of the training and testing errors for the Adadelta Optimizer during the unit tests.

# SGD Optimizer - Implementation and Testing:-

In this blog post, I will be explaning the implementation of the SGD Optimizer with and without momentum approach. I will also be explaning the methodology I used for testing the correctness of the optimizer.

## SGD Optimizer:

Stochastic Gradient Descent is the one of the basic optimization algorithms that is used for training the Deep Neural Networks. I have implemented the SGD Optimization algorithm with and without momentum method. The finalized class diagram for the VOptimizer class and the TSGD class, which is derived from the VOptimizer class is shown below,

Here, Step() function is implemented in the base class VOptimizer. And the UpdateWeights() and UpdateBiases() functions are pure virtual functions. So other optimizer classes extending from the base class VOptimizer, for example: TSGD class must implement the UpdateWeights() and UpdateBiases() functions. The fPastWeightGradients and fPastBiasGradients store the accumulation of the past weight and past bias gradients respectively.

## Momentum Update:

With Stochastic Gradient Descent we don’t compute the exact derivative of our loss function. Instead, we’re estimating it on a small batch. This means we’re not always going in the optimal direction, because our derivatives are ‘noisy’. So, exponentially weighed averages can provide us a better estimate which is closer to the actual derivative than our noisy calculations. This is one reason why momentum might work better than classic SGD.

The other reason lies in ravines. Ravine is an area, where the surface curves much more steeply in one dimension than in another. Ravines are common near local minimas in deep learning and SGD has troubles getting out of them. SGD will tend to oscillate across the narrow ravine since the negative gradient will point down one of the steep sides rather than along the ravine towards the optimum. Momentum helps accelerate gradients in the right direction.

Thus, the momentum update is implemented as follows, ( similar to the tensorflow implementation )

accumulation = momentum * accumulation + gradient
variable -= learning_rate * accumulation


So, one step of update is performed as,

## Testing the Optimizer:

And now, for testing the optimizer, there is no exact methodology that can be used. One possible approach would be to test the convergence of the training and testing error during the training procedure. So the unit tests is created as follows,

Let, X = Random Matrix ( nSamples x nFeatures ),
K = Random Matrix ( nFeatures x nOutput ),
Y = X * K ( nSamples x nOutput ) ( Generated one ).

I created a simple 3 layer DeepNet with the following architecture,

I created the trainingData and testingData in a similar manner as described above. And trained my DeepNet to learn this linear function mapping i.e. Y = X * K. Now for testing, one method is to observe the convergence of the training and testing error. They converged very well in a quite number of iterations as below,

The above figures shows the convergence of the training and testing errors for the SGD optimizer without and with momentum during the unit tests.

Another method is to create a identity matrix I = Identity Matrix ( batchSize x nFeatures ) and give this as Input to the DeepNet and forward it and get the output at the last layer. Let this output be Y’ ( batchSize x nOutput ).

Now, Since the DeepNet is trying to mimic the function Y = X * K, this output Y’ should be equal to K. For this to be true, there is one constrain to be satisfied that the batchSize and nFeatures should be equal so that we can construct the Identity Matrix I as a square matrix with diagonal elements being equal to 1.0. And I got the following results by comparing the Y’ with K.

# Optimization Modules - Class Structure:-

In this blog post, I will be describing the class structure that can be potentially used to include the Optimizers in the TMVA submodule. The already existing API needs to be modified a bit to account for the Optimizers. I will be describing the class structure and also the changes that needs to be done to the existing API.

Atlast, after discussing with the mentors, I finally found the correct design solution for including the Optimizers in the existing framework and the design seems to be feasible.

## Current workflow of one step of Optimizer’s update:

In the current workflow, the class MethodDL creates an instance of the class TDLGradientDescent as minimizer and it is initialized. For performing one step of the update, the minimizer’s Step() function is called. The minimizer’s Step() function in turn calls the class TDeepNet’s Forward(), Backward() and Update() functions. The class TDeepNet’s Update() function iterates through its layers and calls the class VGeneralLayer’s Update() function for each layer. And the class VGeneralLayer’s Update() function is the one which actually does the update of the weights and biases with the corresponding weightGradients and biasGradients through its UpdateWeights() and UpdateBiases() functions.

## Design Issues and its solutions:

1) How to identify the type of update in the class VGeneralLayer like SGD or Adam etc ?

The current implementation by default performs only the schocastic gradient descent update. And there is no way of identifying the type of update depending on the type of Optimizer. So the solution would be to pass the reference to the Optimizer object from the class MethodDL down to the class VGeneralLayer.

2) Where to store the additional variables like sum of past gradients and other parameters specific to each Optimizer ?

Since these additional variables are needed for performing the updates, they need to be stored in the same class which actually performs the update. The current implementation actually performs the update in the class VGeneralLayer. So as per the current implementation, if we store all the additional variables in the class VGeneralLayer, that would look clumsy and strange since that is not the responsibility of the class VGeneralLayer. So I have decided to perform the actual update in the Optimizer class and store additional variables in it. This would look a bit cleaner.

3) Should we need to move the update function from the class GeneralLayer to Optimizer class ?

Yes, we should move the update function to the Optimizer class since its the responsibility of the Optimizer to perform such an update.

4) Do we really need to test the convergence in the Optimizer class itself or move it to the train() method of class MethodDL ?

The current implementation actually has the test for convergence in the Optimizer class itself. But testing for convergence is not the responsibility of the Optimizer class but is the responsibility of the training procedure. So the better option is to move the test for convergence to the train() method of class MethodDL.

## Modified workflow of one step of Optimizer’s update:

In the modified workflow, the class MethodDL creates an instance of the class TSGD ( i.e. specific type of optimizer as mentioned in the option ) as minimizer and it is initialized. Now for performing one step of the update, we actually pass the reference to the minimizer object to the class DeepNet’s Update() function. The class DeepNet’s Update() function iterates through each of its layer objects and pass the same reference of the minimizer object to the class VGeneralLayer’s Update() function. And the class VGeneralLayer’s Update() function now performs the update with the use of minimizer’s Step() function. So depending on the type of minimizer object used to perform the update, the corresponding update is performed. Here, the class TSGD’s Step() function is the one which actually does the update of the weights and biases with the corresponding weightGradients and biasGradients through its UpdateWeights() and UpdateBiases() functions.

## API Changes:

### In class MethodDL :

1) Modify the Struct TTrainingSettings to include the option for optimizer.

2) Create an instance of the particular type of optimizer in the Train() method based on the option specified. If no option is specified, default choice would be to use Adam optimizer.

### In file Functions.h :

1) Create a enum class EOptimizer with various optimizers.

### Other files:

1) Create a base class TOptimizer with the basic functions. ( Refer to the class diagram below. )

2) Create various classes like Class TSGD, Class TAdam etc for each optimizer extending from the base Class TOptimizer.

3) Re-implement the existing Class TDLGradientDescent with the new design as Class TSGD.

Final Design:

I hope that this post helps to get a clear understanding of the new design and its workflow. And atlast I can start coding :D . So my next goal is to re-implement the basic stochastic gradient descent with my new design along with unit tests and make sure everything works good. So, from my next post, I’ll describe more related to coding :P

# Optimization Algorithms - A Brief Overview:-

In this blog post, I would like to give a brief overview of the existing gradient descent optimization algorithms that are available. There are lots of good resources available online. You can check them at the References section at the end of this post.

The existing TMVA submodule has always used gradient descent to update the parameters and minimize the cost of the neural networks. More advanced optimization methods can speed up learning and perhaps even get you to a better final value for the cost function. Having a good optimization algorithm can be the difference between waiting days vs. just a few hours to get a good result.

Gradient descent is a first-order iterative optimization algorithm for finding the minimum of a function. To find a local minimum of a function using gradient descent, one takes steps proportional to the negative of the gradient (or approximate gradient) of the function at the current point. Gradient descent goes “downhill” on a cost function J. Think of it as trying to do this:

There are three variants of gradient descent, which depends on how much data you use to cacluate the gradients and perform an update. They are as follows,

Vanilla gradient descent, also known as batch gradient descent, computes the gradient of the cost function w.r.t. to the parameters $\theta$ for the entire training dataset. It supports maximum vectorization, but if the data is large, it cannot fit into the memory.

Stochastic gradient descent (SGD) in contrast performs a parameter update for each training example of the training dataset. It cannot exploit vectorization, since it has to iterate through all the training examples and make an update for each training example. It also shows a lot of fluctuations before converging to the solution.

Mini-batch gradient descent finally takes the best of both approaches and performs an update for every mini-batch of n training examples. The size of the mini-batch is usually in the power of 2 like 64 or 256, but can vary depending on the applications. It exploits vectorization to some extent and its update is also fast. It is the most preferred way of update among these variants.

Here, I’ll discuss about the various gradient descent optimization algorithms that are proven to work best in most of the applications.

### 1) Momentum based update:

Momentum is a method that helps accelerate SGD in the relevant direction and dampens oscillations as can be seen in the below image. It does this by adding a fraction $\gamma$ of the update vector of the past time step to the current update vector.

SGD without Momentum: SGD with Momentum:

The momentum update is done as follows,

The usual value of $\gamma$ is 0.9. ie ( $\gamma$ < 1 )

### 2) Nesterov accelerated Momentum:

Ilya Sutskever suggested a new form of momentum that often works better. It is inspired by the nesterov method for optimizing convex functions. First, make a big jump in the direction of the previous accumulated gradient. Then, measure the gradient where you end up and make correction. Its better to correct a mistake after you have made it. :P

Nesterov Update:

Here, brown vector = jump, red vector = correction, green vector = accumulated gradient, blue vector = standard momentum.

The Nesterov update is done as follows,

The usual value of $\gamma$ is 0.9. ie ( $\gamma$ < 1 ) and it depends on the application.

AdaGrad is an optimization method that allows different step sizes for different features. It increases the influence of rare but informative features i.e. It adapts the learning rate to the parameters, performing larger updates for infrequent and smaller updates for frequent parameters. For this reason, it is well-suited for dealing with sparse data.

where,
$G_{t}$ - Sum of the squares of the past gradients w.r.t all parameters $\theta$ along its diagonal.
$\odot$ - Matrix-vector dot product.
$g_{t}$ - Gradient at time step $t$.
$\eta$ - Learning rate.
$\epsilon$ - Smoothing term that avoids division by zero and is usually of the order of $1e-8$.

Adadelta is an extension of Adagrad that tries to reduce the monotonically decreasing learning rate. Instead of accumulating all past squared gradients, Adadelta stores only the window of accumulated past gradients to some fixed window size $w$.

And Instead of storing all the past gradients of window size w, it stores the decaying average of the past squared gradients. The running average $E[g^2]_t$ at time step $t$ is calculated as,

The root mean squared (RMS) error of the gradient is therefore,

And also the decaying average of the past squared updates is computed as,

The root mean squared (RMS) error of the updates is therefore,

Since $RMS[\Delta \theta]_{t}$ is unknown, we approximate it with the RMS of parameter updates until the previous time step i.e. $RMS[\Delta \theta]_{t-1}$

Here, the parameters usually take the default value,
$\gamma$ - Usually around 0.9.
$\epsilon$ - Smoothing term that avoids division by zero and is usually of the order of $1e-8$.

### 5) RMSprop:

RMSprop is an unpublished, adaptive learning rate method proposed by Geoff Hinton. The main idea is “Divide the gradient by a running average of its recent magnitude”. It is similar to Adadelta but it is developed independently to overcome the disadvantages of the Adagrad algorithm.

The RMSprop update is done as follows,

where,
$\gamma$ - Usually around 0.9.
$\eta$ - Learning rate, usually around 0.001.
$E[g^2]_t$ - Decaying average of the past squared gradients at time step $t$.

Adaptive Moment Estimation (Adam) is another method that computes adaptive learning rates for each parameter. It stores both the decaying average of the past gradients $m_t$, similar to momentum and also the decaying average of the past squared gradients $v_t$, similar to RMSprop and Adadelta. Thus, it combines the advantages of both the methods. Adam is the default choice of the optimizer for any application in general.

The decaying average of the past gradients $m_t$ and the past squared gradients $v_t$ is computed as follows,

And since these $m_t$ and $v_t$ are initialized with zeros, they are biased towards zero, especially during the initial time steps. Thus, to avoid these biases, the bias corrected versions of them are computed as follows,

Thus, the Adam update is as follows,

where,
$\beta_1$ - usually 0.9.
$\beta_2$ - usually 0.999.
$\eta$ - Learning rate.
$\epsilon$ - usually of the order of $1e-8$.

Adamax is the generalization of the Adam algorithm to the $\ell_{\infty}$ norm. Kingma and Ba show that $v_t$ with $\ell_{\infty}$ converges to the more stable value.

The infinity norm-constrained $v_t$ is denoted as $u_t$ and is computed as follows,

Here, since $u_t$ relies on the max operation, it is not biased towards zero unlike the ones in the Adam algorithm.

Thus, the Adamax update is as follows,

where,
$\beta_1$ - usually 0.9.
$\beta_2$ - usually 0.999.
$\eta$ - Learning rate, usually 0.002.

Nadam is similar to Adam which is a combination of the momentum and the RMSprop. Nadam can be viewed as a combination of the nesterov accelerated momentum and the RMSprop. Here, we do not need to modify the $\hat{v}_t$. The momentum vector equations are as follows,

Expanding the last equation gives,

But since,

we get,

We can now add Nesterov momentum just as we did previously by simply replacing this bias-corrected estimate of the momentum vector of the previous time step $\hat{m}_{t−1}$ with the bias-corrected estimate of the current momentum vector $\hat{m}_t$.

Thus, the Nadam update is as follows,

3) RMSProp