Adagrad and Adadelta Optimizers - Implementation and Testing:-
20 Jul 2018In this blog post, I’ll be explaining the implementation of Adagrad and Adadelta optimizers.
Adagrad:
AdaGrad is an optimization method that allows different step sizes for different features. It increases the influence of rare but informative features i.e. It adapts the learning rate to the parameters, performing larger updates for infrequent and smaller updates for frequent parameters. For this reason, it is well-suited for dealing with sparse data.
Thus, the update is implemented as follows, ( similar to the tensorflow implementation )
Vt = Vt-1 + currentSquaredGradients
theta = theta - learningRate * currentGradients / (sqrt(Vt + epsilon))
So, one step of update is performed as,
Testing Adagrad:
I used the same unit tests approach as for SGD optimizer. Have a look at Testing the SGD optimizer post.
The above figure shows the convergence of the training and testing errors for the Adagrad Optimizer during the unit tests.
Adadelta:
Adadelta is an extension of Adagrad that tries to reduce the monotonically decreasing learning rate. Instead of accumulating all past squared gradients, Adadelta stores only the window of accumulated past gradients to some fixed window size .
Thus, the update is implemented as follows, ( similar to the tensorflow implementation )
Vt = rho * Vt-1 + (1-rho) * currentSquaredGradients
currentUpdates = sqrt(Wt + epsilon) * currentGradients / sqrt(Vt + epsilon)
theta = theta - learningRate * currentUpdates
Wt = rho * Wt-1 + (1-rho) * currentSquaredUpdates
So, one step of update is performed as,
Testing Adadelta:
I used the same unit tests approach as for SGD optimizer. Have a look at Testing the SGD optimizer post.
The above figure shows the convergence of the training and testing errors for the Adadelta Optimizer during the unit tests.