direct-preference-optimization llama7B issue

Hi, i am trying to run the SFT step, using 4 A100 80GB, report error: starting 4 processes for FSDP training setting RLIMIT_NOFILE soft limit to 1048576 from 1048576 /opt/conda/lib/python3.8/multiprocessing/resource_tracker.py:216: UserWarning: resource_tracker: There appear to be 2 leaked semaphore objects to clean up at shutdown warnings.warn('resource_tracker: There appear to be %d ' Bus error Even if i use batch_size=1, the issue is still there. And i wonder if you have the finished model training on anthropic_hh with llama7B?

Sep 25 '23 17:09 JiuhaiChen

I saw the same error even running the demo script of model Pythia28, using 8 A100 40GB at Google Cloud

python -u train.py model=pythia28 datasets=[hh] loss=sft exp_name=anthropic_dpo_pythia28 gradient_accumulation_steps=2 batch_size=64 eval_batch_size=32 trainer=FSDPTrainer sample_during_eval=false model.fsdp_policy_mp=bfloat16

What I got is

building policy
starting 8 processes for FSDP training
setting RLIMIT_NOFILE soft limit to 1048576 from 1048576
/usr/lib/python3.8/multiprocessing/resource_tracker.py:216: UserWarning: resource_tracker: There appear to be 2 leaked semaphore objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '
Bus error (core dumped)

The warning itself seems negligible as it is some hiccup from multiprocessing, but core dump is a serious problem. This codebase looks pretty clean and I like it a lot. Could you please check what is going on, so users can gain confidence on its robustness.

Sep 30 '23 05:09 Emerald01

@Emerald01 have you figured out the problem?

Oct 03 '23 19:10 JiuhaiChen

Not an exact replication of the FSDP version, but, I recently reimplemented DPO with QLoRA on LLaMA-7B model The model is already available on huggingface: https://huggingface.co/abaheti95/dpo_qlora_hh Here is the respective code for implementation: https://github.com/abaheti95/LoL-RL/blob/main/dpo_qlora_llama_hh.py

Oct 04 '23 16:10 abaheti95

Looks good, thanks! But i wonder the performance gap between FSDP and QloRA?

Oct 04 '23 17:10 JiuhaiChen

Even I'm not sure. I would also love to know if there is a huge difference. Let me know if you notice anything.

Oct 04 '23 19:10 abaheti95

@abaheti95 thank you for your work. I think Huggingface has published a DPO trainer with QLora, https://github.com/huggingface/trl/blob/main/trl/trainer/dpo_trainer.py

By any chance, could you please comment on any obvious difference? If you know their work, and you have certain optimization, improvement or bug fix, I would like to try.

Another question is : I think QLora is mostly trying to compress the base model to 4bits from its original 32bits, but it will use DDP so each GPU will get the entire model. On the other hand, FSDP distribute one single model to multiple devices. If I understand right, if I am using 8 GPUs, there compression rate is more or less the same, about 1/8 either from a direct bits quantization or an indirect model distribution. FSDP might suffer more from communication cost among devices, though. My question is, is it possible to combine both, i.e., QLora + FSDP so we can have 4 or 8 bits quantization while distribute the model to multiple devices, that would have 8*8 = 1/64 compression.

Oct 04 '23 20:10 Emerald01

I attempted using their trainer first but was noticing very slow training. I asked TRL about this and they had certain pointers on how to speed it up https://github.com/huggingface/trl/issues/729. By that time, I already had my custom training loop ready, so I just went forward with it. I could finish 1 epoch of HH-RLHF in about 1.5 days using 2 48GB GPUs. In my experiments, I was able to simultaneously keep 2 (4-bit) models (reference and target) in two GPUs without any memory errors.

Personally, I'm not super familiar with Deepspeed and FSDP compression and multi-GPU. I tried using deepspeed earlier but had too much struggle with a custom training loop.

Oct 04 '23 23:10 abaheti95

Awesome work, I will check on this to see how it works soon! BTW, besides the obvious speed issues, I also noticed there is discussions on how well DPO converges, it said that in some cases the chosen actions also got decreased or the margin between the chosen and the rejected is small. https://github.com/huggingface/trl/issues/800 I am not sure if you observed similar patterns in your code, without yet running DPO algo, I doubt if this the issue of trl codebase or the general problem of DPO

Oct 06 '23 19:10 Emerald01

In my DPO training attempt, I also saw that margin was increasing but both chosen and rejected reward decreased.

Oct 06 '23 19:10 abaheti95

In my training attempt (full model fine tuning with llama on HH datasets), in SFT stage, the training loss is not going down. In DPO, all outputs are NAN.

Oct 06 '23 20:10 JiuhaiChen

In my training attempt (full model fine tuning with llama on HH datasets), in SFT stage, the training loss is not going down. In DPO, all outputs are NAN.

same question

Oct 29 '23 11:10 hzy312

Hi @JiuhaiChen , I also encountered this question. Did you solve it?

Apr 11 '24 14:04 chchenhui

@JiuhaiChen Have you solved the buds error problem?

May 15 '24 06:05 Yanfors

@Emerald01

I have the same issue, so has the core dump problem not been solved?

Jun 25 '24 02:06 ro-ko

★---> train stats after 160512 examples: {'rewards_train/chosen': 'nan', 'rewards_train/rejected': 'nan', 'rewards_train/accuracies': '0', 'rewards_train/margins': 'nan', 'l 's0 eogps_train/rejected': 'nan', 'logps_train/chosen': 'nan', 'loss/train': 'nan', 'examples_per_second': '5.3122', 'grad_norm': 'nan', 'counters/examples': 160512, 'counters/upog dates': 5016} tep _ -2 ★---> train stats after 160544 examples: {'rewards_train/chosen': 'nan', 'rewards_train/rejected': 'nan', 'rewards_train/accuracies': '0', 'rewards_train/margins': 'nan', 'l 's0ogps_train/rejected': 'nan', 'logps_train/chosen': 'nan', 'loss/train': 'nan', 'examples_per_second': '5.3297', 'grad_norm': 'nan', 'counters/examples': 160544, 'counters/upog 2dates': 5017} tep _-2 ★---> train stats after 160576 examples: {'rewards_train/chosen': 'nan', 'rewards_train/rejected': 'nan', 'rewards_train/accuracies': '0', 'rewards_train/margins': 'nan', 'l 's0ogps_train/rejected': 'nan', 'logps_train/chosen': 'nan', 'loss/train': 'nan', 'examples_per_second': '5.2214', 'grad_norm': 'nan', 'counters/examples': 160576, 'counters/upog dates': 5018} tep _ ★---> train stats after 160608 examples: {'rewards_train/chosen': 'nan', 'rewards_train/rejected': 'nan', 'rewards_train/accuracies': '0', 'rewards_train/margins': 'nan', 'l 's0ogps_train/rejected': 'nan', 'logps_train/chosen': 'nan', 'loss/train': 'nan', 'examples_per_second': '5.3119', 'grad_norm': 'nan', 'counters/examples': 160608, 'counters/upog dates': 5019} tep _ ★---> train stats after 160640 examples: {'rewards_train/chosen': 'nan', 'rewards_train/rejected': 'nan', 'rewards_train/accuracies': '0', 'rewards_train/margins': 'nan', 'l 's0ogps_train/rejected': 'nan', 'logps_train/chosen': 'nan', 'loss/train': 'nan', 'examples_per_second': '5.3496', 'grad_norm': 'nan', 'counters/examples': 160640, 'counters/upog dates': 5020} tep ★---> train stats after 160672 examples: {'rewards_train/chosen': 'nan', 'rewards_train/rejected': 'nan', 'rewards_train/accuracies': '0', 'rewards_train/margins': 'nan', 'l 'ogps_train/rejected': 'nan', 'logps_train/chosen': 'nan', 'loss/train': 'nan', 'examples_per_second': '5.3799', 'grad_norm': 'nan', 'counters/examples': 160672, 'counters/upog 3dates': 5021} te 4 ★---> train stats after 160704 examples: {'rewards_train/chosen': 'nan', 'rewards_train/rejected': 'nan', 'rewards_train/accuracies': '0', 'rewards_train/margins': 'nan', 'logps_train/rejected': 'nan', 'logps_train/chosen': 'nan', 'loss/train': 'nan', 'examples_per_second': '5.2856', 'grad_norm': 'nan', 'counters/examples': 160704, 'counters/up 8dates': 5022} ★---> train stats after 160736 examples: {'rewards_train/chosen': 'nan', 'rewards_train/rejected': 'nan', 'rewards_train/accuracies': '0', 'rewards_train/margins': 'nan', 'l66ogps_train/rejected': 'nan', 'logps_train/chosen': 'nan', 'loss/train': 'nan', 'examples_per_second': '5.3715', 'grad_norm': 'nan', 'counters/examples': 160736, 'counters/updates': 5023} ★---> train stats after 160768 examples: {'rewards_train/chosen': 'nan', 'rewards_train/rejected': 'nan', 'rewards_train/accuracies': '0', 'rewards_train/margins': 'nan', 'l -ogps_train/rejected': 'nan', 'logps_train/chosen': 'nan', 'loss/train': 'nan', 'examples_per_second': '5.4876', 'grad_norm': 'nan', 'counters/examples': 160768, 'counters/up -dates': 5024} ★---> train stats after 160800 examples: {'rewards_train/chosen': 'nan', 'rewards_train/rejected': 'nan', 'rewards_train/accuracies': '0', 'rewards_train/margins': 'nan', 'l 2ogps_train/rejected': 'nan', 'logps_train/chosen': 'nan', 'loss/train': 'nan', 'examples_per_second': '5.4887', 'grad_norm': 'nan', 'counters/examples': 160800, 'counters/updates': 5025} 24 Finished generating 1 epochs on train split writing checkpoint to .cache/root/anthropic_dpo_pythia69_2024-09-22_09-38-57_157738/LATEST/policy.pt... [rank0]:[2024-09-22 18:41:34,834] torch.distributed.fsdp._debug_utils: [WARNING] FSDP _optim_state_dict() profiling: defaultdict(<class 'float'>, {'preprocessing': 0.012136668432503939, 'preprocessing_with_comm': 0.042172754649072886, 'state_converting': 13.85691294586286, <Type.ALL: 'all'>: 13.912817124743015}) writing checkpoint to .cache/root/anthropic_dpo_pythia69_2024-09-22_09-38-57_157738/LATEST/optimizer.pt... writing checkpoint to .cache/root/anthropic_dpo_pythia69_2024-09-22_09-38-57_157738/LATEST/scheduler.pt...

The output ： the losses are just NAN. what is wrong with this ?

Sep 23 '24 02:09 Alan-D-Chen