ChenDRAG

Results 18 comments of ChenDRAG
trafficstars

Same problem here!

It seems that full finetuning has this problem, while lora doesn't. Could you share the yaml training configuration? Also how many GPUs are you using? ![image](https://github.com/huggingface/alignment-handbook/assets/40993476/184babce-d75c-420c-808d-6ced6cbb765b)

Sorry, I did not encounter this problem. Do you use the official binary dataset? What is your base model? Though I don't think they matter that much.

8 A40 cards. My new experiments also encounter this problem. ![image](https://github.com/huggingface/alignment-handbook/assets/40993476/46b95d22-4919-49a8-80d0-8d6befb6ad77) Difference between the two configurations previous bath size 4 accumulation 2 cards 8 lr 1e-7 new batch size 8...

@alvarobartt Thanks a lot for your kind help! However, in the `scripts`, instructions to reproduce experiments are ``` # Full training with ZeRO-3 on 8 GPUs ACCELERATE_LOG_LEVEL=info accelerate launch --config_file...

p.s. I tried `CUDA_VISIBLE_DEVICES=2,3,4,5,6,7,8,9 ACCELERATE_LOG_LEVEL=info accelerate launch --config_file recipes/accelerate_configs/multi_gpu.yaml --main_process_port 6000 scripts/run_dpo.py recipes/zephyr-7b-beta/dpo/config_full.yaml` and it still reports OOM error on 8*46Gb cards.

> Deepspeed zero3 will shard the model over several GPUs, this should resolve the OOM issues you see. Note we testing on A100 GB GPUS so you may need to...

@LiCHH @ma-xu @Kumbong . It seems the generation seed is replicate for each class. `recon_B3HW = var.autoregressive_infer_cfg(B=B, label_B=label_B, cfg=cfg, top_k=900, top_p=0.96, g_seed=seed, more_smooth=more_smooth) ` gseed should be different . I...