MeZO
MeZO copied to clipboard
Not convergent in custom dataset.
Hi, glad to see the impressive project.
I overload the trainer.py according to README and the training works properly. However, the model doesn't seem to converge and loss is always around 0.7 no matter how many epochs it trains on. BTW, I use LLAMA but not OPT.
As a comparison, if I train the same dataset with fully-finetuning, everything works fine and loss comes to less than 0.1 immediately.
So is there some constraint that may lead to the failure of training?
Hi,
Thanks for your interest in our project! Can you specify the setting (task, model size, hyperparameters like steps, learning rate, eps, etc)? Also, did you use our codebase and replace OPT with LLaMA or you copied the MeZO part to other codebases?
Hi,
Thanks for your interest in our project! Can you specify the setting (task, model size, hyperparameters like steps, learning rate, eps, etc)? Also, did you use our codebase and replace OPT with LLaMA or you copied the MeZO part to other codebases?
Thanks for reply. I integrate your trainer in my own codebase.(LLAMA-7B) lr:2e-5, steps: 3000, eps:1e-3
Hi,
I believe the learning rate is too large. I suggest starting from LR=1e-6, EPS=1e-3, and tune the hyperparameter using grid search. Also the number of steps needed for convergence depends on the task. I'll suggest at least try 5,000 steps. All the OPT experiments in our paper used 20,000 steps.
I'll suggest at least try 5,000 steps. All the OPT experiments in our paper used 20,000 steps.
@gaotianyu1350 how many steps was used in the paper when not using mezo?
i'm running a similar experiment with LLAMA 7B and having trouble getting the model to converge (can share results later today), really curious though to know how many more steps was needed to finetune OPT.
thanks!
FYI, I have done some experiments on a custom dataset with OPT-125m and LORA and I observed the same problem of the loss not going down. I had to resort to really high learning rates and for the first time I see the loss going down. Probably it's too high now, since I see the los not going down smoothly but rather stepwise. I am using
- lr: 4e-2, cosine lr schedule
- zo_eps: 5e-3,
- batch size of 32
- ~300k trainable parameters
The loss reported by deepspeed
{'loss': 16.2535, 'learning_rate': 0.03999983126569921, 'epoch': 0.0}
{'loss': 15.5859, 'learning_rate': 0.039998481408375613, 'epoch': 0.0}
{'loss': 15.8855, 'learning_rate': 0.03999578178483493, 'epoch': 0.0}
{'loss': 15.832, 'learning_rate': 0.039991732577284014, 'epoch': 0.01}
{'loss': 16.0207, 'learning_rate': 0.0399863340590178, 'epoch': 0.01}
{'loss': 16.0691, 'learning_rate': 0.03997958659440085, 'epoch': 0.01}
{'loss': 15.9879, 'learning_rate': 0.03997149063884271, 'epoch': 0.01}
{'loss': 15.766, 'learning_rate': 0.03996204673876726, 'epoch': 0.01}
{'loss': 14.9773, 'learning_rate': 0.03995125553157573, 'epoch': 0.01}
{'loss': 7.8742, 'learning_rate': 0.03993911774560379, 'epoch': 0.01}
{'loss': 8.3316, 'learning_rate': 0.039925634200072314, 'epoch': 0.01}
{'loss': 7.8898, 'learning_rate': 0.039910805805032104, 'epoch': 0.02}
{'loss': 8.0785, 'learning_rate': 0.039894633561302496, 'epoch': 0.02}
{'loss': 8.0207, 'learning_rate': 0.039877118560403775, 'epoch': 0.02}
{'loss': 7.818, 'learning_rate': 0.03985826198448353, 'epoch': 0.02}
{'loss': 8.1402, 'learning_rate': 0.039838065106236845, 'epoch': 0.02}
{'loss': 7.884, 'learning_rate': 0.039816529288820436, 'epoch': 0.02}
{'loss': 7.8727, 'learning_rate': 0.039793655985760595, 'epoch': 0.02}
{'loss': 7.9, 'learning_rate': 0.039769446740855134, 'epoch': 0.02}
{'loss': 7.6941, 'learning_rate': 0.03974390318806917, 'epoch': 0.03}
{'loss': 7.8531, 'learning_rate': 0.039717027051424825, 'epoch': 0.03}
{'loss': 7.8488, 'learning_rate': 0.03968882014488491, 'epoch': 0.03}
{'loss': 7.8055, 'learning_rate': 0.03965928437223045, 'epoch': 0.03}
{'loss': 7.8672, 'learning_rate': 0.03962842172693222, 'epoch': 0.03}
{'loss': 7.9863, 'learning_rate': 0.03959623429201618, 'epoch': 0.03}
{'loss': 7.8461, 'learning_rate': 0.03956272423992289, 'epoch': 0.03}
{'loss': 7.9762, 'learning_rate': 0.03952789383236089, 'epoch': 0.04}
So it will take some time until it converges but the loss is going down which is nice
@nousr You can refer to our Appendix D for steps used in each experiment.
@lramming Can you specify which dataset is this? Note that two key points to make MeZO work (1) always using prompts, (2) longer training time. All our OPT experiments use 20K steps, though you should expect to see significant performance improvement with 5k steps.
Unfortunately, I can't comment on the dataset. At the end, unfortunately it did not converge but remained at a loss of ~7, which is a lot higher than training on normal optimisers. However, I am fairly certain that it is more a problem of choosing the correct hyper parameters than an issue with the actual algorithm. I did find out that it is highly dependent on the choice of zo_eps and the learning rate; in some experiments going a bit higher with zo_ops actually improved how the loss went down. I tried some experiments normalising the gradient before updating the parameters and it helped improve the convergence in some cases but ultimately it failed. Either the loss hovered around some high value without going down or it experienced a problem and went to 0.
I also suspect that the current implementation is not working well with Deepspeed; you can switch off the deepspeed optimiser by removing the 'optimiser' part of the deepspeed config and specifying "zero_force_ds_cpu_optimizer": false.
Probably doing a proper sweep across the hyperparameters that are useful for the given dataset is a good idea, maybe I have some time to implement this in the future.
The same issue for me on LLaMa 7B, the loss was not reducing. I have used LR=1e-6, and EPS=1e-3 with 8600 steps.
Hi, not sure if there is any update. But I recently realized I gave a wrong hyper parameter in the README. For example, OPT-13B + SST-2 should use 1e-7/1e-3. So I would suggest trying more hyper parameter tuning (especially LR)