Optimizers
Usage of optimizers
An optimizer is one of the two arguments required for compiling a model.
Scala:
model.compile(loss = "mean_squared_error", optimizer = "sgd")
Python:
model.compile(loss='mean_squared_error', optimizer='sgd')
Scala:
model.compile(loss = "mean_squared_error", optimizer = Adam())
Python:
model.compile(loss='mean_squared_error', optimizer=Adam())
Available optimizers
SGD
A plain implementation of SGD which provides optimize method. After setting optimization method when create Optimize, Optimize will call optimization method at the end of each iteration.
Scala:
val optimMethod = SGD(learningRate = 1e-3, learningRateDecay = 0.0,
weightDecay = 0.0, momentum = 0.0, dampening = Double.MaxValue,
nesterov = false, learningRateSchedule = Default(),
learningRates = null, weightDecays = null)
Parameters:
learningRate
: learning ratelearningRateDecay
: learning rate decayweightDecay
: weight decaymomentum
: momentumdampening
: dampening for momentumnesterov
: enables Nesterov momentumlearningRateSchedule
: learning rate schedulerlearningRates
: 1D tensor of individual learning ratesweightDecays
: 1D tensor of individual weight decays
Python:
optim_method = SGD(learningrate=1e-3, learningrate_decay=0.0, weightdecay=0.0,
momentum=0.0, dampening=DOUBLEMAX, nesterov=False,
leaningrate_schedule=None, learningrates=None,
weightdecays=None)
Parameters:
learningrate
: learning ratelearningrate_decay
: learning rate decayweightdecay
: weight decaymomentum
: momentumdampening
: dampening for momentumnesterov
: enables Nesterov momentumleaningrate_schedule
: learning rate schedulerlearningrates
: 1D tensor of individual learning ratesweightdecays
: 1D tensor of individual weight decays
Adam
An implementation of Adam optimization, first-order gradient-based optimization of stochastic objective functions. http://arxiv.org/pdf/1412.6980.pdf
Scala:
val optimMethod = new Adam(learningRate = 1e-3, learningRateDecay = 0.0, beta1 = 0.9, beta2 = 0.999, Epsilon = 1e-8)
Parameters:
learningRate
learning rate. Default value is 1e-3.learningRateDecay
learning rate decay. Default value is 0.0.beta1
first moment coefficient. Default value is 0.9.beta2
second moment coefficient. Default value is 0.999.Epsilon
for numerical stability. Default value is 1e-8.
Python:
optim_method = Adam(learningrate=1e-3, learningrate_decay=0.0, beta1=0.9, beta2=0.999, epsilon=1e-8)
Parameters:
learningrate
learning rate. Default value is 1e-3.learningrate_decay
learning rate decay. Default value is 0.0.beta1
first moment coefficient. Default value is 0.9.beta2
second moment coefficient. Default value is 0.999.epsilon
for numerical stability. Default value is 1e-8.
Adamax
An implementation of Adamax: http://arxiv.org/pdf/1412.6980.pdf
Scala:
val optimMethod = new Adamax(learningRate = 0.002, beta1 = 0.9, beta2 = 0.999, Epsilon = 1e-8)
Parameters:
learningRate
: learning ratebeta1
: first moment coefficientbeta2
: second moment coefficientEpsilon
: for numerical stability
Python:
optim_method = Adam(learningrate=0.002, beta1=0.9, beta2=0.999, epsilon=1e-8)
Parameters:
learningrate
: learning ratebeta1
: first moment coefficientbeta2
: second moment coefficientepsilon
: for numerical stability
Adadelta
AdaDelta implementation for SGD
It has been proposed in ADADELTA: An Adaptive Learning Rate Method
.
http://arxiv.org/abs/1212.5701.
Scala:
val optimMethod = Adadelta(decayRate = 0.9, Epsilon = 1e-10)
Parameters:
decayRate
: decayRate, also called interpolation parameter rhoEpsilon
: for numerical stability
Python:
optim_method = AdaDelta(decayrate=0.9, epsilon=1e-10)
Parameters:
decayrate
: decayRate, also called interpolation parameter rhoepsilon
: for numerical stability
Adagrad
An implementation of Adagrad. See the original paper: http://jmlr.org/papers/volume12/duchi11a/duchi11a.pdf
Scala:
val optimMethod = new Adagrad(learningRate = 1e-3, learningRateDecay = 0.0, weightDecay = 0.0)
learningRate
: learning ratelearningRateDecay
: learning rate decayweightDecay
: weight decay
Python:
optim_method = Adagrad(learningrate=1e-3, learningrate_decay=0.0, weightdecay=0.0)
Parameters:
learningrate
: learning ratelearningrate_decay
: learning rate decayweightdecay
: weight decay
Rmsprop
An implementation of RMSprop (Reference: http://arxiv.org/pdf/1308.0850v5.pdf, Sec 4.2)
Scala:
val optimMethod = new RMSprop(learningRate = 0.002, learningRateDecay = 0.0, decayRate = 0.99, Epsilon = 1e-8)
Parameters:
learningRate
: learning ratelearningRateDecay
: learning rate decaydecayRate
: decayRate, also called rhoEpsilon
: for numerical stability
Python:
optim_method = RMSprop(learningrate=0.002, learningrate_decay=0.0, decayrate=0.99, epsilon=1e-8)
Parameters:
learningrate
: learning ratelearningrate_decay
: learning rate decaydecayrate
: decayRate, also called rhoepsilon
: for numerical stability