Adagrad and Adadelta Optimizers - Implementation and Testing:-

In this blog post, I’ll be explaining the implementation of Adagrad and Adadelta optimizers.

Adagrad:

AdaGrad is an optimization method that allows different step sizes for different features. It increases the influence of rare but informative features i.e. It adapts the learning rate to the parameters, performing larger updates for infrequent and smaller updates for frequent parameters. For this reason, it is well-suited for dealing with sparse data.

Thus, the update is implemented as follows, ( similar to the tensorflow implementation )

Vt = Vt-1 + currentSquaredGradients
theta = theta - learningRate * currentGradients / (sqrt(Vt + epsilon))

So, one step of update is performed as,

Testing Adagrad:

I used the same unit tests approach as for SGD optimizer. Have a look at Testing the SGD optimizer post.

ADAGRADUTP

The above figure shows the convergence of the training and testing errors for the Adagrad Optimizer during the unit tests.

Adadelta:

Adadelta is an extension of Adagrad that tries to reduce the monotonically decreasing learning rate. Instead of accumulating all past squared gradients, Adadelta stores only the window of accumulated past gradients to some fixed window size .

Thus, the update is implemented as follows, ( similar to the tensorflow implementation )

Vt = rho * Vt-1 + (1-rho) * currentSquaredGradients
currentUpdates = sqrt(Wt + epsilon) * currentGradients / sqrt(Vt + epsilon)
theta = theta - learningRate * currentUpdates
Wt = rho * Wt-1 + (1-rho) * currentSquaredUpdates

So, one step of update is performed as,

Testing Adadelta:

I used the same unit tests approach as for SGD optimizer. Have a look at Testing the SGD optimizer post.

ADADELTAUTP

The above figure shows the convergence of the training and testing errors for the Adadelta Optimizer during the unit tests.

References:

1) Adagrad Optimizer - Tensorflow Implementation

2) Adadelta Optimizer - Tensorflow Implementation