A gentle hessian for efficient gradient descent

Download

R. Collobert and S. Bengio. A gentle hessian for efficient gradient descent. In IEEE International Conference on Acoustic, Speech, and Signal Processing, ICASSP, 2004.

Abstract

Several second-order optimization methods for gradient descent algorithms have been proposed over the years, but they usually need to compute the inverse of the Hessian of the cost function (or an approximation of this inverse) during training. In most cases, this leads to an O(n^2) cost in time and space per iteration, where n is the number of parameters, which is prohibitive for large n. We propose instead a study of the Hessian before training. Based on a second order analysis, we show that a block-diagonal Hessian yields an easier optimization problem than a full Hessian. We also show that the condition of block-diagonality in common machine learning models can be achieved by simply selecting an appropriate training criterion. Finally, we propose a version of the SVM criterion applied to MLPs, which verifies the aspects highlighted in this second order analysis, but also yields very good generalization performance in practice, taking advantage of the margin effect. Several empirical comparisons on two benchmark datasets are given to illustrate this approach.

BibTeX

@inproceedings{collobert:2004,
  author = {R. Collobert and S. Bengio},
  title = {A Gentle Hessian for Efficient Gradient Descent},
  booktitle = {{IEEE} International Conference on Acoustic, Speech, and Signal Processing, {ICASSP}},
  year = 2004
}

Notes

Probably because in the past neural network were studied on very small databases, many people believe neural networks overfit easily. I would correct by: if not well tuned (like a SVM having a Gaussian kernel with a small variance!) neural networks do overfit. But in fact, in many cases, they are hard to train.

We show here that the choice of the architecture itself has an impact on the optimization.

In particular we show that the margin criterion used in SVMs is well suited for neural network optimization: with the hinge loss, the Hessian is better conditioned than classical loss like Mean Squared Error.


Last modified on Tue Apr 15 16:17:13 2008