Sylvain Gugger comments

Results 631 comments of


                                            Sylvain Gugger

Running accelerate test after setting up FSDP returns an error

cc @pacman100

Issue when using a dataloader on MULTI_GPU

I can't reproduce on my side as the prompt yielded in the batches make Accelerate fail: since this is is an iterable dataset, `dispatch_batches` is activated by default and the...

Training with Accelerator Fails. RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:1 and cuda:7! (when checking argument for argument index in method wrapper__index_select)

Could you share a minimal sample of code reproducing the error please?

Training with Accelerator Fails. RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:1 and cuda:7! (when checking argument for argument index in method wrapper__index_select)

@muellerzr the problem is in the forward though ;-) And it should work for training as long as there is no offload.

Training with Accelerator Fails. RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:1 and cuda:7! (when checking argument for argument index in method wrapper__index_select)

@ananda1996ai First note that you cannot use data parallel in conjunction with model parallelism, so num_processes in your config needs to be 1. I cannot reproduce the error, could you...

Training with Accelerator Fails. RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:1 and cuda:7! (when checking argument for argument index in method wrapper__index_select)

Oh the problem is quite clear then, the process only sees GPU 7. I think it all stems from the fact that you use `num_processes=2` in your accelerate config.

setting CUDA_VISIBLE_DEVICES will not work when import get_peft_model(only import this method) method before loading model by invoking load_checkpoint_and_dispatch

cc @pacman100

Sylvain Gugger

Running accelerate test after setting up FSDP returns an error

Issue when using a dataloader on MULTI_GPU

Training with Accelerator Fails. RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:1 and cuda:7! (when checking argument for argument index in method wrapper__index_select)

Training with Accelerator Fails. RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:1 and cuda:7! (when checking argument for argument index in method wrapper__index_select)

Training with Accelerator Fails. RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:1 and cuda:7! (when checking argument for argument index in method wrapper__index_select)

Training with Accelerator Fails. RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:1 and cuda:7! (when checking argument for argument index in method wrapper__index_select)

setting CUDA_VISIBLE_DEVICES will not work when import get_peft_model(only import this method) method before loading model by invoking load_checkpoint_and_dispatch

Model not offloading to disk when RAM is full

Model not offloading to disk when RAM is full

Cannot import