Optimizers
Usage of optimizers
An optimizer is one of the two arguments required for compiling a model.
Scala:
model.compile(loss = "mean_squared_error", optimizer = "sgd")
Python:
model.compile(loss='mean_squared_error', optimizer='sgd')
Scala:
model.compile(loss = "mean_squared_error", optimizer = Adam())
Python:
model.compile(loss='mean_squared_error', optimizer=Adam())
Available optimizers
SGD
A plain implementation of SGD which provides optimize method. After setting optimization method when create Optimize, Optimize will call optimization method at the end of each iteration.
Scala:
val optimMethod = SGD(learningRate = 1e-3, learningRateDecay = 0.0,
weightDecay = 0.0, momentum = 0.0, dampening = Double.MaxValue,
nesterov = false, learningRateSchedule = Default(),
learningRates = null, weightDecays = null)
Parameters:
learningRate: learning ratelearningRateDecay: learning rate decayweightDecay: weight decaymomentum: momentumdampening: dampening for momentumnesterov: enables Nesterov momentumlearningRateSchedule: learning rate schedulerlearningRates: 1D tensor of individual learning ratesweightDecays: 1D tensor of individual weight decays
Python:
optim_method = SGD(learningrate=1e-3, learningrate_decay=0.0, weightdecay=0.0,
momentum=0.0, dampening=DOUBLEMAX, nesterov=False,
leaningrate_schedule=None, learningrates=None,
weightdecays=None)
Parameters:
learningrate: learning ratelearningrate_decay: learning rate decayweightdecay: weight decaymomentum: momentumdampening: dampening for momentumnesterov: enables Nesterov momentumleaningrate_schedule: learning rate schedulerlearningrates: 1D tensor of individual learning ratesweightdecays: 1D tensor of individual weight decays
Adam
An implementation of Adam optimization, first-order gradient-based optimization of stochastic objective functions. http://arxiv.org/pdf/1412.6980.pdf
Scala:
val optimMethod = new Adam(learningRate = 1e-3, learningRateDecay = 0.0, beta1 = 0.9, beta2 = 0.999, Epsilon = 1e-8)
Parameters:
learningRatelearning rate. Default value is 1e-3.learningRateDecaylearning rate decay. Default value is 0.0.beta1first moment coefficient. Default value is 0.9.beta2second moment coefficient. Default value is 0.999.Epsilonfor numerical stability. Default value is 1e-8.
Python:
optim_method = Adam(learningrate=1e-3, learningrate_decay=0.0, beta1=0.9, beta2=0.999, epsilon=1e-8)
Parameters:
learningratelearning rate. Default value is 1e-3.learningrate_decaylearning rate decay. Default value is 0.0.beta1first moment coefficient. Default value is 0.9.beta2second moment coefficient. Default value is 0.999.epsilonfor numerical stability. Default value is 1e-8.
Adamax
An implementation of Adamax: http://arxiv.org/pdf/1412.6980.pdf
Scala:
val optimMethod = new Adamax(learningRate = 0.002, beta1 = 0.9, beta2 = 0.999, Epsilon = 1e-8)
Parameters:
learningRate: learning ratebeta1: first moment coefficientbeta2: second moment coefficientEpsilon: for numerical stability
Python:
optim_method = Adam(learningrate=0.002, beta1=0.9, beta2=0.999, epsilon=1e-8)
Parameters:
learningrate: learning ratebeta1: first moment coefficientbeta2: second moment coefficientepsilon: for numerical stability
Adadelta
AdaDelta implementation for SGD
It has been proposed in ADADELTA: An Adaptive Learning Rate Method.
http://arxiv.org/abs/1212.5701.
Scala:
val optimMethod = Adadelta(decayRate = 0.9, Epsilon = 1e-10)
Parameters:
decayRate: decayRate, also called interpolation parameter rhoEpsilon: for numerical stability
Python:
optim_method = AdaDelta(decayrate=0.9, epsilon=1e-10)
Parameters:
decayrate: decayRate, also called interpolation parameter rhoepsilon: for numerical stability
Adagrad
An implementation of Adagrad. See the original paper: http://jmlr.org/papers/volume12/duchi11a/duchi11a.pdf
Scala:
val optimMethod = new Adagrad(learningRate = 1e-3, learningRateDecay = 0.0, weightDecay = 0.0)
learningRate: learning ratelearningRateDecay: learning rate decayweightDecay: weight decay
Python:
optim_method = Adagrad(learningrate=1e-3, learningrate_decay=0.0, weightdecay=0.0)
Parameters:
learningrate: learning ratelearningrate_decay: learning rate decayweightdecay: weight decay
Rmsprop
An implementation of RMSprop (Reference: http://arxiv.org/pdf/1308.0850v5.pdf, Sec 4.2)
Scala:
val optimMethod = new RMSprop(learningRate = 0.002, learningRateDecay = 0.0, decayRate = 0.99, Epsilon = 1e-8)
Parameters:
learningRate: learning ratelearningRateDecay: learning rate decaydecayRate: decayRate, also called rhoEpsilon: for numerical stability
Python:
optim_method = RMSprop(learningrate=0.002, learningrate_decay=0.0, decayrate=0.99, epsilon=1e-8)
Parameters:
learningrate: learning ratelearningrate_decay: learning rate decaydecayrate: decayRate, also called rhoepsilon: for numerical stability