DPO Trainer Crashes on multi-gpu setup!
System Info
Kaggle Notebook With 2X T4 GPUs. [link to Kaggle notebook:] (https://www.kaggle.com/code/augustmurr/dpo-issue-recreationl) The issue does not occur when loading the model on one GPU (for example "cuda:0"), but the trainer will only use 1 GPU which is very inefficient.
Who can help?
@muellerzr @pacman100
Information
- [ ] The official example scripts
- [X] My own modified scripts
Tasks
- [ ] An officially supported task in the
examplesfolder (such as GLUE/SQuAD, ...) - [X] My own task or dataset (give details below)
Reproduction
link to Kaggle notebook: https://www.kaggle.com/code/augustmurr/dpo-issue-recreation
Expected behavior
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cuda:1!
This looks to be an issue with how the weights are loaded with device_map=auto rather than with trainer. Possibly to do with _no_split_modules
cc @younesbelkada @ArthurZucker
Hi @August-murr !
I agree with what @amyeroberts said, I think that you are loading the model with device_map="auto". In order to correctly perform multi-gpu training, please refer to this comment: https://github.com/huggingface/accelerate/issues/1840#issuecomment-1683105994 - let us know how it goes!
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.