Improving Generalization Performance by Switching from Adam to SGD

2 days ago
1

This paper investigates a method called SWATS, which switches from an adaptive optimization method like Adam to SGD (Stochastic Gradient Descent) during training. This approach aims to combine the benefits of both optimizers, namely, Adam's fast initial progress and SGD's superior generalization performance. The paper proposes a criterion for determining the optimal switch point and uses a projection-based method to estimate the learning rate for SGD after the switch. The effectiveness of SWATS is demonstrated through experiments on various image classification and language modeling tasks, showing comparable performance to the best individual optimizer.

The paper can be found here: https://www.semanticscholar.org/reader/8f253d759d99e92888bf9eb595c59cf962fd9069

All Images by NosazenaNFT: https://linktr.ee/nosazenanfts

Nosacapital Square (On Binance): https://www.binance.com/en/square/profile/NosaCapital

Open a Free Morpher Account (and get free tokens): https://nosax.me/morpher

#nosacapital #machinelearning #machinelearningmodels #SGD #ADAM #SWATS

Loading comments...