Visual Computing

University of Konstanz
Advances in Neural Information Processing Systems,

Newton Losses: Using Curvature Information for Learning with Differentiable Algorithms

F. Petersen, C. Borgelt, T. Sutter, H. Kuehne, O. Deussen, S. Ermon
Teaser of Newton Losses: Using Curvature Information for Learning with Differentiable Algorithms

Ablation study wrt. the Tikhonov regularization strength hyperparameter λ. Displayed is the element-wise ranking accuracy (individual element ranks correctly identified), averaged over 10 seeds, and additionallyeach seed with low opacity in the background. Left: NeuralSort. Right: SoftSort. Each for n = 5. Newton Losses, and for both the Hessian and the Fisher variant, significantly improve over the baseline for up to (or beyond) 6 orders of magnitude in variation of its hyperparameter λ. Note the logarithmic horizontal axis.

Material

Paper (.pdf, 619.3KB)

Abstract

When training neural networks with custom objectives, such as ranking losses and shortest-path losses, a common problem is that they are, per se, non-differentiable. A popular approach is to continuously relax the objectives to provide gradients, enabling learning. However, such differentiable relaxations are often non-convex and can exhibit vanishing and exploding gradients, making them (already in isolation) hard to optimize. Here, the loss function poses the bottleneck when training a deep neural network. We present Newton Losses, a method for improving the performance of existing hard to optimize losses by exploiting their second-order information via their empirical Fisher and Hessian matrices. Instead of training the neural network with second-order techniques, we only utilize the loss function’s second-order information to replace it by a Newton Loss, while training the network with gradient descent. This makes our method computationally efficient. We apply Newton Losses to eight differentiable algorithms for sorting and shortest-paths, achieving significant improvements for less-optimized differentiable algorithms, and consistent improvements, even for well-optimized differentiable algorithms

BibTeX

@inproceedings{Petersen2024NewtonLossesUsing,
  author    = {F. Petersen, C. Borgelt, T. Sutter, H. Kuehne, O. Deussen, S. Ermon},
  booktitle = {Advances in Neural Information Processing Systems},
  doi       = {10.52202/079017-0821},
  editor    = {A. Globerson and L. Mackey and D. Belgrave and A. Fan and U. Paquet and J. Tomczak and C. Zhang},
  pages     = {26113--26135},
  publisher = {Curran Associates, Inc.},
  title     = {Newton Losses: Using Curvature Information for Learning with Differentiable Algorithms},
  url       = {https://proceedings.neurips.cc/paper_files/paper/2024/file/2e102d937d094b7211c4d32ce1f1126c-Paper-Conference.pdf},
  volume    = {37},
  year      = {2024}
}