AdaGrad is an optimization method that allows different step sizes for different features. It increases the influence of rare but informative features i.e. It adapts the learning rate to the parameters, performing larger updates for infrequent and smaller updates for frequent parameters. For this reason, it is well-suited for dealing with sparse data.

Thus, the update is implemented as follows, ( similar to the tensorflow implementation )

Vt = Vt-1 + currentSquaredGradients
theta = theta - learningRate * currentGradients / (sqrt(Vt + epsilon))


So, one step of update is performed as,

I used the same unit tests approach as for SGD optimizer. Have a look at Testing the SGD optimizer post.

The above figure shows the convergence of the training and testing errors for the Adagrad Optimizer during the unit tests.

Adadelta is an extension of Adagrad that tries to reduce the monotonically decreasing learning rate. Instead of accumulating all past squared gradients, Adadelta stores only the window of accumulated past gradients to some fixed window size $w$.

Thus, the update is implemented as follows, ( similar to the tensorflow implementation )

Vt = rho * Vt-1 + (1-rho) * currentSquaredGradients
theta = theta - learningRate * currentUpdates
Wt = rho * Wt-1 + (1-rho) * currentSquaredUpdates



So, one step of update is performed as,