LMFlow Issue with Recreating RAFT Llama-7b Lora Benchmarks

Hey, our team is trying to recreate the RAFT (RAFT: Reward rAnked FineTuning for Generative Foundation Model Alignment) paper HH-RLHF benchmarks with Llama-7b. We successfully did the SFT step, however, when we do the reward modelling, our accuracy is noticeably lower than the one reported in the paper (we're getting ~71% vs the ~79% reported in the paper). We have about the same training setup:

Paper: 8x A100 (40GB)
Our setup: 8x A6000 (48GB)

Also, even with a bit more vram, we can't do batches of 32 (8x4) on the graphics cards because of the reward modelling step requiring both the chosen and rejected pairs being loaded in at the same time, effectively halving the batch size that we can fit. Is this correct? We followed the RAFT paper closely:

Used linear lr schedule
0.00002 learning rate for SFT
0.00003 learning rate for RM
Same LoRA config (16, 32, .1)
32 Batch Size for SFT
Since we can't fit 32 batch size for RM we tried 16 and 8.
1 epoch for both steps

Is use gradient accumulation used? In the paper there was no mention of it, however, we found that we got better accuracies using two steps of gradient accumulation, essentially keeping the batch size at 32. Any help would be greatly appreciated!

Jun 29 '23 16:06 maximkha

Thanks for your interest!

Do you start the reward modeling with the LLaMA-SFT-7B? In our experiments, if we start reward modeling from the original llama-7b, we will indeed get 71.64 %. But if we start with llama-sft-7b, we can get 79.52. A similar observation holds for llama-13b (85.27% v.s. ~80%).

Jun 30 '23 06:06 WeiXiongUST

Thanks so much for the quick reply! I'm pretty sure we did SFT the model before we did RM.

Our SFT train loss was about 1.539 and the test loss was about 1.577. Does that sound right?

Also here are our W&B runs for the SFT & RM.

Here's the SFT training log: https://wandb.ai/jirayu/huggingface/runs/8cck88n9/overview?workspace=user-jirayu and here's the RM training log: https://wandb.ai/jirayu/huggingface/runs/b2b54rrz/overview?workspace=user-jirayu

Jul 03 '23 17:07 maximkha

Also would it be possible to share the checkpoint files for the LLaMA-SFT-7B or the reward model?

Jul 03 '23 20:07 maximkha

I think one potential issue we recently notice is that the evaluation batch size should be set to 1. A batch size > 1 will lead to a much lower evaluation accuracy. We are currently not sure whether this is a bug from the hf trainer or something.

Jul 04 '23 14:07 WeiXiongUST

Hi @WeiXiongUST! I'm answering this on behalf of Max. We already set the evaluation batch size per device to 1 for this 71% accuracy run.

Jul 05 '23 07:07 Top34051

@WeiXiongUST Any updates?

Jul 10 '23 02:07 maximkha

Do you use full training or Lora training? Line 45 of examples/reward_modeling decides the mode of training. For full training, you may need to use a much smaller learning rate (say 5e-6).

Jul 10 '23 04:07 WeiXiongUST

Thanks so much for following up @WeiXiongUST, we used Lora for all the steps (sft, and rm)

Jul 10 '23 15:07 maximkha

Also is Lora used during the SFT training?

Jul 10 '23 16:07 maximkha

Also is Lora used during the SFT training?

In our experiments, we use full training for SFT, and LoRA training for LLaMA-7B.

We recently tried out open-llama-3b: full training SFT for 2 epochs, block size 1024, learning rate 2e-5; RM for 1 epoch, full training with learning rate 5e-6, this leads to 76% validation accuracy. However, in an earlier experiment, if we use lr = 2e-5 in RM with full training, the accuracy is <70%.

I guess it is because the learning rate in your SFT stage may not be suitable.

Jul 11 '23 03:07 WeiXiongUST

Thanks so so much, we'll look into this!

Jul 11 '23 04:07 maximkha

Hi @WeiXiongUST! We went back and trained the SFT model with full training and used it to train the reward model with LoRA, both following the hyperparameters in the RAFT paper (pg. 14). We still achieved 72% final eval accuracy. We would be appreciated if you could take a look at the loss/performance curves in both of our runs to see if anything is suspicious. We just started training RM with full training, but we still would love to understand the potential cause that our RM can't reach 78% accuracy with the same parameters and LoRA described in the paper. Thanks!!

Training log for SFT: https://wandb.ai/jirayu/huggingface/runs/a2tbxhkw Training log for RM: https://wandb.ai/jirayu/huggingface/runs/0n6327rw

To make sure, are you using 512 or 1024 for block size when training SFT and RM on Llama-7b with HH-RLHF?

Jul 12 '23 07:07 Top34051

It seems that the evaluation loss is larger than our results. For instance, for the open-llama-3b experiment, we can achieve an evaluation loss of ~0.49. We also use block size 1024 for llama-7b.

We just uploaded an rm based on open-llama-3b, which achieves an evaluation accuracy of 75.48 and you may check it out.

https://huggingface.co/weqweasdas/hh_rlhf_rm_open_llama_3b/blob/main/README.md

Jul 14 '23 12:07 WeiXiongUST