MeZO Not convergent in custom dataset.

Hi, glad to see the impressive project.

I overload the trainer.py according to README and the training works properly. However, the model doesn't seem to converge and loss is always around 0.7 no matter how many epochs it trains on. BTW, I use LLAMA but not OPT.

As a comparison, if I train the same dataset with fully-finetuning, everything works fine and loss comes to less than 0.1 immediately.

So is there some constraint that may lead to the failure of training?

Jun 07 '23 05:06 jcao-ai

Hi,

Thanks for your interest in our project! Can you specify the setting (task, model size, hyperparameters like steps, learning rate, eps, etc)? Also, did you use our codebase and replace OPT with LLaMA or you copied the MeZO part to other codebases?

Jun 07 '23 16:06 gaotianyu1350

Hi,

Thanks for your interest in our project! Can you specify the setting (task, model size, hyperparameters like steps, learning rate, eps, etc)? Also, did you use our codebase and replace OPT with LLaMA or you copied the MeZO part to other codebases?

Thanks for reply. I integrate your trainer in my own codebase.(LLAMA-7B) lr:2e-5, steps: 3000, eps:1e-3

Jun 08 '23 13:06 jcao-ai

Hi,

I believe the learning rate is too large. I suggest starting from LR=1e-6, EPS=1e-3, and tune the hyperparameter using grid search. Also the number of steps needed for convergence depends on the task. I'll suggest at least try 5,000 steps. All the OPT experiments in our paper used 20,000 steps.

Jun 08 '23 17:06 gaotianyu1350

I'll suggest at least try 5,000 steps. All the OPT experiments in our paper used 20,000 steps.

@gaotianyu1350 how many steps was used in the paper when not using mezo?

i'm running a similar experiment with LLAMA 7B and having trouble getting the model to converge (can share results later today), really curious though to know how many more steps was needed to finetune OPT.

thanks!

Jun 08 '23 19:06 nousr

FYI, I have done some experiments on a custom dataset with OPT-125m and LORA and I observed the same problem of the loss not going down. I had to resort to really high learning rates and for the first time I see the loss going down. Probably it's too high now, since I see the los not going down smoothly but rather stepwise. I am using

lr: 4e-2, cosine lr schedule
zo_eps: 5e-3,
batch size of 32
~300k trainable parameters

The loss reported by deepspeed {'loss': 16.2535, 'learning_rate': 0.03999983126569921, 'epoch': 0.0}
{'loss': 15.5859, 'learning_rate': 0.039998481408375613, 'epoch': 0.0}
{'loss': 15.8855, 'learning_rate': 0.03999578178483493, 'epoch': 0.0}
{'loss': 15.832, 'learning_rate': 0.039991732577284014, 'epoch': 0.01}
{'loss': 16.0207, 'learning_rate': 0.0399863340590178, 'epoch': 0.01}
{'loss': 16.0691, 'learning_rate': 0.03997958659440085, 'epoch': 0.01}
{'loss': 15.9879, 'learning_rate': 0.03997149063884271, 'epoch': 0.01}
{'loss': 15.766, 'learning_rate': 0.03996204673876726, 'epoch': 0.01}
{'loss': 14.9773, 'learning_rate': 0.03995125553157573, 'epoch': 0.01}
{'loss': 7.8742, 'learning_rate': 0.03993911774560379, 'epoch': 0.01}
{'loss': 8.3316, 'learning_rate': 0.039925634200072314, 'epoch': 0.01}
{'loss': 7.8898, 'learning_rate': 0.039910805805032104, 'epoch': 0.02}
{'loss': 8.0785, 'learning_rate': 0.039894633561302496, 'epoch': 0.02}
{'loss': 8.0207, 'learning_rate': 0.039877118560403775, 'epoch': 0.02}
{'loss': 7.818, 'learning_rate': 0.03985826198448353, 'epoch': 0.02}
{'loss': 8.1402, 'learning_rate': 0.039838065106236845, 'epoch': 0.02}
{'loss': 7.884, 'learning_rate': 0.039816529288820436, 'epoch': 0.02}
{'loss': 7.8727, 'learning_rate': 0.039793655985760595, 'epoch': 0.02}
{'loss': 7.9, 'learning_rate': 0.039769446740855134, 'epoch': 0.02}
{'loss': 7.6941, 'learning_rate': 0.03974390318806917, 'epoch': 0.03}
{'loss': 7.8531, 'learning_rate': 0.039717027051424825, 'epoch': 0.03}
{'loss': 7.8488, 'learning_rate': 0.03968882014488491, 'epoch': 0.03}
{'loss': 7.8055, 'learning_rate': 0.03965928437223045, 'epoch': 0.03}
{'loss': 7.8672, 'learning_rate': 0.03962842172693222, 'epoch': 0.03}
{'loss': 7.9863, 'learning_rate': 0.03959623429201618, 'epoch': 0.03}
{'loss': 7.8461, 'learning_rate': 0.03956272423992289, 'epoch': 0.03}
{'loss': 7.9762, 'learning_rate': 0.03952789383236089, 'epoch': 0.04}

So it will take some time until it converges but the loss is going down which is nice

Jun 12 '23 07:06 lramming

@nousr You can refer to our Appendix D for steps used in each experiment.

@lramming Can you specify which dataset is this? Note that two key points to make MeZO work (1) always using prompts, (2) longer training time. All our OPT experiments use 20K steps, though you should expect to see significant performance improvement with 5k steps.

Jun 13 '23 17:06 gaotianyu1350

Unfortunately, I can't comment on the dataset. At the end, unfortunately it did not converge but remained at a loss of ~7, which is a lot higher than training on normal optimisers. However, I am fairly certain that it is more a problem of choosing the correct hyper parameters than an issue with the actual algorithm. I did find out that it is highly dependent on the choice of zo_eps and the learning rate; in some experiments going a bit higher with zo_ops actually improved how the loss went down. I tried some experiments normalising the gradient before updating the parameters and it helped improve the convergence in some cases but ultimately it failed. Either the loss hovered around some high value without going down or it experienced a problem and went to 0.

I also suspect that the current implementation is not working well with Deepspeed; you can switch off the deepspeed optimiser by removing the 'optimiser' part of the deepspeed config and specifying "zero_force_ds_cpu_optimizer": false.

Probably doing a proper sweep across the hyperparameters that are useful for the given dataset is a good idea, maybe I have some time to implement this in the future.

Jun 14 '23 06:06 lramming

The same issue for me on LLaMa 7B, the loss was not reducing. I have used LR=1e-6, and EPS=1e-3 with 8600 steps.

Jun 22 '23 16:06 dittops

Hi, not sure if there is any update. But I recently realized I gave a wrong hyper parameter in the README. For example, OPT-13B + SST-2 should use 1e-7/1e-3. So I would suggest trying more hyper parameter tuning (especially LR)

Aug 02 '23 15:08 gaotianyu1350

MeZO MeZO copied to clipboard

Not convergent in custom dataset.

MeZO
MeZO copied to clipboard