Anandamoy Bandyopadhyay comments

Results 5 comments of


                                            Anandamoy Bandyopadhyay

Training with Accelerator Fails. RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:1 and cuda:7! (when checking argument for argument index in method wrapper__index_select)

I run the following code (bloom-accelerate-trainer-minimal.py) on a setup of 8 A4500 GPUs of 20GB vRAM ``` import argparse import os from transformers import AdamW, get_linear_schedule_with_warmup from datasets import load_dataset...

Training with Accelerator Fails. RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:1 and cuda:7! (when checking argument for argument index in method wrapper__index_select)

> @muellerzr the problem is in the forward though ;-) And it should work for training as long as there is no offload. There isn't CPU offload as far as...

Training with Accelerator Fails. RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:1 and cuda:7! (when checking argument for argument index in method wrapper__index_select)

Did you mean `model.hf_device_map`? There is no attribute `_hf_device_map`. The output of `print_rank0(model.hf_device_map)` is simply ` {'': 7}` which is not correct perhaps. I set the `device_map="balanced"` and `num_processes =...

Training with Accelerator Fails. RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:1 and cuda:7! (when checking argument for argument index in method wrapper__index_select)

> Oh the problem is quite clear then, the process only sees GPU 7. I think it all stems from the fact that you use `num_processes=2` in your accelerate config....

ValueError in confusion_matrix_tf2.py

@harshit-777 No, I abandoned this codebase long ago. Try scripting your confusion matrix code using the bbox coors from the model outputs.