tensor2tensor How to use adafactor in a standard TF training?

Description

Hi, I want to use adafactor to replace the Adam in my code; But I do not use the T2T framework; Based on the google-released BERT-finetune framework, I just copy the source code of your implementation of adafactor and call like this: optimizer = AdafactorOptimizer() tvars = tf.trainable_variables() grads, = blabla... train_op = optimizer.apply_gradients(list(zip(grads, tvars)),name='train_op')

But it seems not work; The loss did not decrease and the accuracy is very low; Also, I notice the memory usage is almost the same as Adam; What's wrong with this? Could anyone explain to me? Or I could only use the adafactor under the T2T framework? ... ######UPDATE######### It seems works now, but must be passed into the learning rate mannually; The problem is adafactor converges very very slow and performs worse than the Adam, maybe due to the learning rate I chose; Is there any suggestion how to fix it (how to choose a proper decay_rate and learning_rate)? Thanks!

Environment information

OS:  Ubuntu 16.04.6 LTS 
python 3.5.2
tensorflow 1.15.0
tensor2tensor: the latest one

Jan 14 '20 11:01 gai1995

Same question. I have tried using tensor2tensor.utils.adafactor.AdafactorOptimizer with tensorflow 1.15 to train ALBERT model but the loss did not decrease too.

Feb 15 '20 04:02 lxylxyoo

Same question I even encounter a question: AttributeError: 'AdafactorOptimizer' object has no attribute 'get_gradients'

Nov 01 '22 08:11 shizhediao