accelerate Accelerate not working when setting subset of GPUs as visible CUDA devices

System Info

/home/felipe/anaconda3/envs/cuda_12.1/lib/python3.11/site-packages/torchvision/io/image.py:13: UserWarning: Failed to load image Python extension: '/home/felipe/anaconda3/envs/cuda_12.1/lib/python3.11/site-packages/torchvision/image.so: undefined symbol: _ZN3c1017RegisterOperatorsD1Ev'If you don't plan on using image functionality from `torchvision.io`, you can ignore this warning. Otherwise, there might be something wrong with your environment. Did you have `libjpeg` or `libpng` installed before building `torchvision` from source?
  warn(

Copy-and-paste the text below in your GitHub issue

- `Accelerate` version: 0.27.0
- Platform: Linux-5.15.0-94-generic-x86_64-with-glibc2.35
- Python version: 3.11.6
- Numpy version: 1.26.4
- PyTorch version (GPU?): 2.2.0 (True)
- PyTorch XPU available: False
- PyTorch NPU available: False
- System RAM: 125.63 GB
- GPU type: NVIDIA GeForce RTX 4090
- `Accelerate` default config:
	- compute_environment: LOCAL_MACHINE
	- distributed_type: MULTI_GPU
	- mixed_precision: no
	- use_cpu: False
	- debug: False
	- num_processes: 2
	- machine_rank: 0
	- num_machines: 1
	- rdzv_backend: static
	- same_network: False
	- main_training_function: main
	- downcast_bf16: False
	- tpu_use_cluster: False
	- tpu_use_sudo: False

I hav 1 3090 , and 2 4090 GPUs

Information

[ ] The official example scripts
[X] My own modified scripts

Tasks

[ ] One of the scripts in the examples/ folder of Accelerate or an officially supported no_trainer script in the examples folder of the transformers repo (such as run_no_trainer_glue.py)
[X] My own task or dataset (give details below)

Reproduction

I have this loop

def train_ddp_accelerate(CFG, fold_id, train, output_path):
    accelerator = Accelerator(split_batches=True,mixed_precision='fp16')
   # accelerator = Accelerator(mixed_precision='fp16')
    set_seed(CFG.seed)
    


    device = accelerator.device #'cuda'#torch.device(CFG.device) 
    
    train_path_label, val_path_label, _, _ = get_path_label(fold_id, train_all)
    train_transform, val_transform = get_transforms(CFG)
    
    train_dataset = HMSHBACSpecDataset(**train_path_label, transform=train_transform)
    val_dataset = HMSHBACSpecDataset(**val_path_label, transform=val_transform)
    
    train_loader = torch.utils.data.DataLoader(
        train_dataset, batch_size=CFG.batch_size,pin_memory=True, num_workers=4, shuffle=True, drop_last=True)
    val_loader = torch.utils.data.DataLoader(
        val_dataset, batch_size=CFG.batch_size,pin_memory=True, num_workers=4, shuffle=False, drop_last=False)
    
    model = HMSHBACSpecModel(
        model_name=CFG.model_name, pretrained=True, num_classes=6, in_channels=1)
    

   # model = torch.nn.parallel.DataParallel(model, device_ids=[0, 1, 2])
    
    optimizer = optim.AdamW(params=model.parameters(), lr=CFG.lr, weight_decay=CFG.weight_decay)
    scheduler = lr_scheduler.OneCycleLR(
        optimizer=optimizer, epochs=CFG.max_epoch,
        pct_start=0.0, steps_per_epoch=len(train_loader),
        max_lr=CFG.lr, div_factor=25, final_div_factor=4.0e-01
    )
    
    loss_func = KLDivLossWithLogits()
    loss_func.to(device)
  #  loss_func = torch.nn.parallel.DataParallel(loss_func, device_ids=[0, 1, 2])
    loss_func_val = KLDivLossWithLogits()
    loss_func_val.to(device)
  #  loss_func_val = torch.nn.parallel.DataParallel(loss_func_val, device_ids=[0, 1, 2])
    
   
    

    # Send everything through `accelerator.prepare`
    train_loader, val_loader, model, optimizer,scheduler = accelerator.prepare(
        train_loader, val_loader, model, optimizer,scheduler
    )

    best_val_loss = 1.0e+09
    best_epoch = 0
    train_loss = 0
    # Train for a single epoch
    
    
    for epoch in range(1, CFG.max_epoch + 1):
        epoch_start = time()
    
        model.train()
        for batch in train_loader:
            #batch = to_device(batch, device)
            x, t = batch["data"], batch["target"]
                
            optimizer.zero_grad()
            with accelerator.autocast():
                y = model(x)
                loss = loss_func(y, t)
            accelerator.backward(loss)
            optimizer.step()
            if not accelerator.optimizer_step_was_skipped:
                scheduler.step()
            train_loss += loss.detach()
            
        train_loss /= len(train_loader)

            

        # Evaluate
        model.eval()
        correct = 0
        val_loss=0
        with torch.no_grad():
            for batch in val_loader:
                x, t = batch["data"], batch["target"]
               # x = to_device(x, device)

               
                val_loss +=  loss_func_val(y, t).detach()
        
        val_loss /= len(val_loader)

        accelerator.wait_for_everyone()
        total_val_loss = accelerator.reduce(val_loss).cpu()
        total_train_loss = accelerator.reduce(train_loss).cpu()
        if val_loss < best_val_loss:
            
            best_epoch = epoch
            best_val_loss = val_loss
            # print("save model")
            
            if accelerator.is_main_process:
                accelerator.save_model(model, str(output_path) + f'snapshot_epoch_{epoch}')
       
       
       
        #reduced_tensor = accelerator.reduce(process_tensor, reduction="sum")
        
        elapsed_time = time() - epoch_start
        accelerator.wait_for_everyone()
        if accelerator.is_main_process:
            print(
                f"[epoch {epoch}] train loss: {total_train_loss: .6f}, val loss: {total_val_loss: .6f}, elapsed_time: {elapsed_time: .3f}")
        
        accelerator.wait_for_everyone()    
        if epoch - best_epoch > CFG.es_patience:
            if accelerator.is_main_process:
                print("Early Stopping!")
            accelerator.wait_for_everyone()  
            break
            
        train_loss = 0

    
    

    #print(f'Accuracy: {100. * correct / len(val_loader.dataset)}')   
    accelerator.end_training()
    accelerator.clear()`

When running like this it runs as expected:


import os 

os.environ["NCCL_P2P_DISABLE"]="1"

for fold_id in FOLDS[3:]:
    output_path = Path(f"fold{fold_id}")
    output_path.mkdir(exist_ok=True)
    print(f"[fold{fold_id}]")
    notebook_launcher(train_ddp_accelerate, args=(CFG, fold_id, train, output_path), num_processes=3,mixed_precision='fp16')

But when running like this

import os 
os.environ['CUDA_DEVICE_ORDER']="PCI_BUS_ID"
os.environ["CUDA_VISIBLE_DEVICES"] = "1,2"
os.environ["NCCL_P2P_DISABLE"]="1"


for fold_id in FOLDS[3:]:
    output_path = Path(f"fold{fold_id}")
    output_path.mkdir(exist_ok=True)
    print(f"[fold{fold_id}]")
    notebook_launcher(train_ddp_accelerate, args=(CFG, fold_id, train, output_path), num_processes=2,mixed_precision='fp16')

I get

> ---------------------------------------------------------------------------
ProcessRaisedException                    Traceback (most recent call last)
File [~/anaconda3/envs/cuda_12.1/lib/python3.11/site-packages/accelerate/launchers.py:200](https://file+.vscode-resource.vscode-cdn.net/home/felipe/ssdpny0/hms-harmful-brain-activity-classification/code/~/anaconda3/envs/cuda_12.1/lib/python3.11/site-packages/accelerate/launchers.py:200), in notebook_launcher(function, args, num_processes, mixed_precision, use_port, master_addr, node_rank, num_nodes)
    [199](https://file+.vscode-resource.vscode-cdn.net/home/felipe/ssdpny0/hms-harmful-brain-activity-classification/code/~/anaconda3/envs/cuda_12.1/lib/python3.11/site-packages/accelerate/launchers.py:199) try:
--> [200](https://file+.vscode-resource.vscode-cdn.net/home/felipe/ssdpny0/hms-harmful-brain-activity-classification/code/~/anaconda3/envs/cuda_12.1/lib/python3.11/site-packages/accelerate/launchers.py:200)     start_processes(launcher, args=args, nprocs=num_processes, start_method="fork")
    [201](https://file+.vscode-resource.vscode-cdn.net/home/felipe/ssdpny0/hms-harmful-brain-activity-classification/code/~/anaconda3/envs/cuda_12.1/lib/python3.11/site-packages/accelerate/launchers.py:201) except ProcessRaisedException as e:

File [~/anaconda3/envs/cuda_12.1/lib/python3.11/site-packages/torch/multiprocessing/spawn.py:197](https://file+.vscode-resource.vscode-cdn.net/home/felipe/ssdpny0/hms-harmful-brain-activity-classification/code/~/anaconda3/envs/cuda_12.1/lib/python3.11/site-packages/torch/multiprocessing/spawn.py:197), in start_processes(fn, args, nprocs, join, daemon, start_method)
    [196](https://file+.vscode-resource.vscode-cdn.net/home/felipe/ssdpny0/hms-harmful-brain-activity-classification/code/~/anaconda3/envs/cuda_12.1/lib/python3.11/site-packages/torch/multiprocessing/spawn.py:196) # Loop on join until it returns True or raises an exception.
--> [197](https://file+.vscode-resource.vscode-cdn.net/home/felipe/ssdpny0/hms-harmful-brain-activity-classification/code/~/anaconda3/envs/cuda_12.1/lib/python3.11/site-packages/torch/multiprocessing/spawn.py:197) while not context.join():
    [198](https://file+.vscode-resource.vscode-cdn.net/home/felipe/ssdpny0/hms-harmful-brain-activity-classification/code/~/anaconda3/envs/cuda_12.1/lib/python3.11/site-packages/torch/multiprocessing/spawn.py:198)     pass

File [~/anaconda3/envs/cuda_12.1/lib/python3.11/site-packages/torch/multiprocessing/spawn.py:158](https://file+.vscode-resource.vscode-cdn.net/home/felipe/ssdpny0/hms-harmful-brain-activity-classification/code/~/anaconda3/envs/cuda_12.1/lib/python3.11/site-packages/torch/multiprocessing/spawn.py:158), in ProcessContext.join(self, timeout)
    [157](https://file+.vscode-resource.vscode-cdn.net/home/felipe/ssdpny0/hms-harmful-brain-activity-classification/code/~/anaconda3/envs/cuda_12.1/lib/python3.11/site-packages/torch/multiprocessing/spawn.py:157) msg += original_trace
--> [158](https://file+.vscode-resource.vscode-cdn.net/home/felipe/ssdpny0/hms-harmful-brain-activity-classification/code/~/anaconda3/envs/cuda_12.1/lib/python3.11/site-packages/torch/multiprocessing/spawn.py:158) raise ProcessRaisedException(msg, error_index, failed_process.pid)

ProcessRaisedException: 

-- Process 0 terminated with the following error:
Traceback (most recent call last):
  File "/home/felipe/anaconda3/envs/cuda_12.1/lib/python3.11/site-packages/torch/cuda/__init__.py", line 315, in _lazy_init
    queued_call()
  File "/home/felipe/anaconda3/envs/cuda_12.1/lib/python3.11/site-packages/torch/cuda/__init__.py", line 183, in _check_capability
    capability = get_device_capability(d)
                 ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/felipe/anaconda3/envs/cuda_12.1/lib/python3.11/site-packages/torch/cuda/__init__.py", line 439, in get_device_capability
    prop = get_device_properties(device)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/felipe/anaconda3/envs/cuda_12.1/lib/python3.11/site-packages/torch/cuda/__init__.py", line 457, in get_device_properties
    return _get_device_properties(device)  # type: ignore[name-defined]
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: device >= 0 && device < num_gpus INTERNAL ASSERT FAILED at "/opt/conda/conda-bld/pytorch_1704987288773/work/aten/src/ATen/cuda/CUDAContext.cpp":50, please report a bug to PyTorch. device=2, num_gpus=

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/felipe/anaconda3/envs/cuda_12.1/lib/python3.11/site-packages/torch/multiprocessing/spawn.py", line 68, in _wrap
    fn(i, *args)
  File "/home/felipe/anaconda3/envs/cuda_12.1/lib/python3.11/site-packages/accelerate/utils/launch.py", line 570, in __call__
    self.launcher(*args)
  File "/tmp/ipykernel_1310472/2664963675.py", line 3, in train_ddp_accelerate
    accelerator = Accelerator(mixed_precision='fp16')
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/felipe/anaconda3/envs/cuda_12.1/lib/python3.11/site-packages/accelerate/accelerator.py", line 378, in __init__
    self.state = AcceleratorState(
                 ^^^^^^^^^^^^^^^^^
  File "/home/felipe/anaconda3/envs/cuda_12.1/lib/python3.11/site-packages/accelerate/state.py", line 771, in __init__
    PartialState(cpu, **kwargs)
  File "/home/felipe/anaconda3/envs/cuda_12.1/lib/python3.11/site-packages/accelerate/state.py", line 236, in __init__
    torch.cuda.set_device(self.device)
  File "/home/felipe/anaconda3/envs/cuda_12.1/lib/python3.11/site-packages/torch/cuda/__init__.py", line 408, in set_device
    torch._C._cuda_setDevice(device)
  File "/home/felipe/anaconda3/envs/cuda_12.1/lib/python3.11/site-packages/torch/cuda/__init__.py", line 321, in _lazy_init
    raise DeferredCudaCallError(msg) from e
torch.cuda.DeferredCudaCallError: CUDA call failed lazily at initialization with error: device >= 0 && device < num_gpus INTERNAL ASSERT FAILED at "/opt/conda/conda-bld/pytorch_1704987288773/work/aten/src/ATen/cuda/CUDAContext.cpp":50, please report a bug to PyTorch. device=2, num_gpus=

CUDA call was originally invoked at:

  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "/home/felipe/anaconda3/envs/cuda_12.1/lib/python3.11/site-packages/ipykernel_launcher.py", line 18, in <module>
    app.launch_new_instance()
  File "/home/felipe/anaconda3/envs/cuda_12.1/lib/python3.11/site-packages/traitlets/config/application.py", line 1075, in launch_instance
    app.start()
  File "/home/felipe/anaconda3/envs/cuda_12.1/lib/python3.11/site-packages/ipykernel/kernelapp.py", line 739, in start
    self.io_loop.start()
  File "/home/felipe/anaconda3/envs/cuda_12.1/lib/python3.11/site-packages/tornado/platform/asyncio.py", line 195, in start
    self.asyncio_loop.run_forever()
  File "/home/felipe/anaconda3/envs/cuda_12.1/lib/python3.11/asyncio/base_events.py", line 607, in run_forever
    self._run_once()
  File "/home/felipe/anaconda3/envs/cuda_12.1/lib/python3.11/asyncio/base_events.py", line 1922, in _run_once
    handle._run()
  File "/home/felipe/anaconda3/envs/cuda_12.1/lib/python3.11/asyncio/events.py", line 80, in _run
    self._context.run(self._callback, *self._args)
  File "/home/felipe/anaconda3/envs/cuda_12.1/lib/python3.11/site-packages/ipykernel/kernelbase.py", line 542, in dispatch_queue
    await self.process_one()
  File "/home/felipe/anaconda3/envs/cuda_12.1/lib/python3.11/site-packages/ipykernel/kernelbase.py", line 531, in process_one
    await dispatch(*args)
  File "/home/felipe/anaconda3/envs/cuda_12.1/lib/python3.11/site-packages/ipykernel/kernelbase.py", line 437, in dispatch_shell
    await result
  File "/home/felipe/anaconda3/envs/cuda_12.1/lib/python3.11/site-packages/ipykernel/ipkernel.py", line 359, in execute_request
    await super().execute_request(stream, ident, parent)
  File "/home/felipe/anaconda3/envs/cuda_12.1/lib/python3.11/site-packages/ipykernel/kernelbase.py", line 775, in execute_request
    reply_content = await reply_content
  File "/home/felipe/anaconda3/envs/cuda_12.1/lib/python3.11/site-packages/ipykernel/ipkernel.py", line 446, in do_execute
    res = shell.run_cell(
  File "/home/felipe/anaconda3/envs/cuda_12.1/lib/python3.11/site-packages/ipykernel/zmqshell.py", line 549, in run_cell
    return super().run_cell(*args, **kwargs)
  File "/home/felipe/anaconda3/envs/cuda_12.1/lib/python3.11/site-packages/IPython/core/interactiveshell.py", line 3051, in run_cell
    result = self._run_cell(
  File "/home/felipe/anaconda3/envs/cuda_12.1/lib/python3.11/site-packages/IPython/core/interactiveshell.py", line 3106, in _run_cell
    result = runner(coro)
  File "/home/felipe/anaconda3/envs/cuda_12.1/lib/python3.11/site-packages/IPython/core/async_helpers.py", line 129, in _pseudo_sync_runner
    coro.send(None)
  File "/home/felipe/anaconda3/envs/cuda_12.1/lib/python3.11/site-packages/IPython/core/interactiveshell.py", line 3311, in run_cell_async
    has_raised = await self.run_ast_nodes(code_ast.body, cell_name,
  File "/home/felipe/anaconda3/envs/cuda_12.1/lib/python3.11/site-packages/IPython/core/interactiveshell.py", line 3493, in run_ast_nodes
    if await self.run_code(code, result, async_=asy):
  File "/home/felipe/anaconda3/envs/cuda_12.1/lib/python3.11/site-packages/IPython/core/interactiveshell.py", line 3553, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "/tmp/ipykernel_1310472/3735111654.py", line 18, in <module>
    import torch
  File "<frozen importlib._bootstrap>", line 1176, in _find_and_load
  File "<frozen importlib._bootstrap>", line 1147, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 690, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 940, in exec_module
  File "<frozen importlib._bootstrap>", line 241, in _call_with_frames_removed
  File "/home/felipe/anaconda3/envs/cuda_12.1/lib/python3.11/site-packages/torch/__init__.py", line 1421, in <module>
    _C._initExtension(manager_path())
  File "<frozen importlib._bootstrap>", line 1176, in _find_and_load
  File "<frozen importlib._bootstrap>", line 1147, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 690, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 940, in exec_module
  File "<frozen importlib._bootstrap>", line 241, in _call_with_frames_removed
  File "/home/felipe/anaconda3/envs/cuda_12.1/lib/python3.11/site-packages/torch/cuda/__init__.py", line 247, in <module>
    _lazy_call(_check_capability)
  File "/home/felipe/anaconda3/envs/cuda_12.1/lib/python3.11/site-packages/torch/cuda/__init__.py", line 244, in _lazy_call
    _queued_calls.append((callable, traceback.format_stack()))



The above exception was the direct cause of the following exception:

RuntimeError                              Traceback (most recent call last)
Cell In[31], [line 6](vscode-notebook-cell:?execution_count=31&line=6)
      [4](vscode-notebook-cell:?execution_count=31&line=4) output_path.mkdir(exist_ok=True)
      [5](vscode-notebook-cell:?execution_count=31&line=5) print(f"[fold{fold_id}]")
----> [6](vscode-notebook-cell:?execution_count=31&line=6) notebook_launcher(train_ddp_accelerate, args=(CFG, fold_id, train, output_path), num_processes=2,mixed_precision='fp16')

File [~/anaconda3/envs/cuda_12.1/lib/python3.11/site-packages/accelerate/launchers.py:210](https://file+.vscode-resource.vscode-cdn.net/home/felipe/ssdpny0/hms-harmful-brain-activity-classification/code/~/anaconda3/envs/cuda_12.1/lib/python3.11/site-packages/accelerate/launchers.py:210), in notebook_launcher(function, args, num_processes, mixed_precision, use_port, master_addr, node_rank, num_nodes)
    [203](https://file+.vscode-resource.vscode-cdn.net/home/felipe/ssdpny0/hms-harmful-brain-activity-classification/code/~/anaconda3/envs/cuda_12.1/lib/python3.11/site-packages/accelerate/launchers.py:203)                 raise RuntimeError(
    [204](https://file+.vscode-resource.vscode-cdn.net/home/felipe/ssdpny0/hms-harmful-brain-activity-classification/code/~/anaconda3/envs/cuda_12.1/lib/python3.11/site-packages/accelerate/launchers.py:204)                     "CUDA has been initialized before the `notebook_launcher` could create a forked subprocess. "
    [205](https://file+.vscode-resource.vscode-cdn.net/home/felipe/ssdpny0/hms-harmful-brain-activity-classification/code/~/anaconda3/envs/cuda_12.1/lib/python3.11/site-packages/accelerate/launchers.py:205)                     "This likely stems from an outside import causing issues once the `notebook_launcher()` is called. "
    [206](https://file+.vscode-resource.vscode-cdn.net/home/felipe/ssdpny0/hms-harmful-brain-activity-classification/code/~/anaconda3/envs/cuda_12.1/lib/python3.11/site-packages/accelerate/launchers.py:206)                     "Please review your imports and test them when running the `notebook_launcher()` to identify "
    [207](https://file+.vscode-resource.vscode-cdn.net/home/felipe/ssdpny0/hms-harmful-brain-activity-classification/code/~/anaconda3/envs/cuda_12.1/lib/python3.11/site-packages/accelerate/launchers.py:207)                     "which one is problematic and causing CUDA to be initialized."
    [208](https://file+.vscode-resource.vscode-cdn.net/home/felipe/ssdpny0/hms-harmful-brain-activity-classification/code/~/anaconda3/envs/cuda_12.1/lib/python3.11/site-packages/accelerate/launchers.py:208)                 ) from e
    [209](https://file+.vscode-resource.vscode-cdn.net/home/felipe/ssdpny0/hms-harmful-brain-activity-classification/code/~/anaconda3/envs/cuda_12.1/lib/python3.11/site-packages/accelerate/launchers.py:209)             else:
--> [210](https://file+.vscode-resource.vscode-cdn.net/home/felipe/ssdpny0/hms-harmful-brain-activity-classification/code/~/anaconda3/envs/cuda_12.1/lib/python3.11/site-packages/accelerate/launchers.py:210)                 raise RuntimeError(f"An issue was found when launching the training: {e}") from e
    [212](https://file+.vscode-resource.vscode-cdn.net/home/felipe/ssdpny0/hms-harmful-brain-activity-classification/code/~/anaconda3/envs/cuda_12.1/lib/python3.11/site-packages/accelerate/launchers.py:212) else:
    [213](https://file+.vscode-resource.vscode-cdn.net/home/felipe/ssdpny0/hms-harmful-brain-activity-classification/code/~/anaconda3/envs/cuda_12.1/lib/python3.11/site-packages/accelerate/launchers.py:213)     # No need for a distributed launch otherwise as it's either CPU, GPU or MPS.
    [214](https://file+.vscode-resource.vscode-cdn.net/home/felipe/ssdpny0/hms-harmful-brain-activity-classification/code/~/anaconda3/envs/cuda_12.1/lib/python3.11/site-packages/accelerate/launchers.py:214)     if is_mps_available():

RuntimeError: An issue was found when launching the training: 

-- Process 0 terminated with the following error:
Traceback (most recent call last):
  File "/home/felipe/anaconda3/envs/cuda_12.1/lib/python3.11/site-packages/torch/cuda/__init__.py", line 315, in _lazy_init
    queued_call()
  File "/home/felipe/anaconda3/envs/cuda_12.1/lib/python3.11/site-packages/torch/cuda/__init__.py", line 183, in _check_capability
    capability = get_device_capability(d)
                 ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/felipe/anaconda3/envs/cuda_12.1/lib/python3.11/site-packages/torch/cuda/__init__.py", line 439, in get_device_capability
    prop = get_device_properties(device)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/felipe/anaconda3/envs/cuda_12.1/lib/python3.11/site-packages/torch/cuda/__init__.py", line 457, in get_device_properties
    return _get_device_properties(device)  # type: ignore[name-defined]
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: device >= 0 && device < num_gpus INTERNAL ASSERT FAILED at "/opt/conda/conda-bld/pytorch_1704987288773/work/aten/src/ATen/cuda/CUDAContext.cpp":50, please report a bug to PyTorch. device=2, num_gpus=

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/felipe/anaconda3/envs/cuda_12.1/lib/python3.11/site-packages/torch/multiprocessing/spawn.py", line 68, in _wrap
    fn(i, *args)
  File "/home/felipe/anaconda3/envs/cuda_12.1/lib/python3.11/site-packages/accelerate/utils/launch.py", line 570, in __call__
    self.launcher(*args)
  File "/tmp/ipykernel_1310472/2664963675.py", line 3, in train_ddp_accelerate
    accelerator = Accelerator(mixed_precision='fp16')
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/felipe/anaconda3/envs/cuda_12.1/lib/python3.11/site-packages/accelerate/accelerator.py", line 378, in __init__
    self.state = AcceleratorState(
                 ^^^^^^^^^^^^^^^^^
  File "/home/felipe/anaconda3/envs/cuda_12.1/lib/python3.11/site-packages/accelerate/state.py", line 771, in __init__
    PartialState(cpu, **kwargs)
  File "/home/felipe/anaconda3/envs/cuda_12.1/lib/python3.11/site-packages/accelerate/state.py", line 236, in __init__
    torch.cuda.set_device(self.device)
  File "/home/felipe/anaconda3/envs/cuda_12.1/lib/python3.11/site-packages/torch/cuda/__init__.py", line 408, in set_device
    torch._C._cuda_setDevice(device)
  File "/home/felipe/anaconda3/envs/cuda_12.1/lib/python3.11/site-packages/torch/cuda/__init__.py", line 321, in _lazy_init
    raise DeferredCudaCallError(msg) from e
torch.cuda.DeferredCudaCallError: CUDA call failed lazily at initialization with error: device >= 0 && device < num_gpus INTERNAL ASSERT FAILED at "/opt/conda/conda-bld/pytorch_1704987288773/work/aten/src/ATen/cuda/CUDAContext.cpp":50, please report a bug to PyTorch. device=2, num_gpus=

CUDA call was originally invoked at:

  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "/home/felipe/anaconda3/envs/cuda_12.1/lib/python3.11/site-packages/ipykernel_launcher.py", line 18, in <module>
    app.launch_new_instance()
  File "/home/felipe/anaconda3/envs/cuda_12.1/lib/python3.11/site-packages/traitlets/config/application.py", line 1075, in launch_instance
    app.start()
  File "/home/felipe/anaconda3/envs/cuda_12.1/lib/python3.11/site-packages/ipykernel/kernelapp.py", line 739, in start
    self.io_loop.start()
  File "/home/felipe/anaconda3/envs/cuda_12.1/lib/python3.11/site-packages/tornado/platform/asyncio.py", line 195, in start
    self.asyncio_loop.run_forever()
  File "/home/felipe/anaconda3/envs/cuda_12.1/lib/python3.11/asyncio/base_events.py", line 607, in run_forever
    self._run_once()
  File "/home/felipe/anaconda3/envs/cuda_12.1/lib/python3.11/asyncio/base_events.py", line 1922, in _run_once
    handle._run()
  File "/home/felipe/anaconda3/envs/cuda_12.1/lib/python3.11/asyncio/events.py", line 80, in _run
    self._context.run(self._callback, *self._args)
  File "/home/felipe/anaconda3/envs/cuda_12.1/lib/python3.11/site-packages/ipykernel/kernelbase.py", line 542, in dispatch_queue
    await self.process_one()
  File "/home/felipe/anaconda3/envs/cuda_12.1/lib/python3.11/site-packages/ipykernel/kernelbase.py", line 531, in process_one
    await dispatch(*args)
  File "/home/felipe/anaconda3/envs/cuda_12.1/lib/python3.11/site-packages/ipykernel/kernelbase.py", line 437, in dispatch_shell
    await result
  File "/home/felipe/anaconda3/envs/cuda_12.1/lib/python3.11/site-packages/ipykernel/ipkernel.py", line 359, in execute_request
    await super().execute_request(stream, ident, parent)
  File "/home/felipe/anaconda3/envs/cuda_12.1/lib/python3.11/site-packages/ipykernel/kernelbase.py", line 775, in execute_request
    reply_content = await reply_content
  File "/home/felipe/anaconda3/envs/cuda_12.1/lib/python3.11/site-packages/ipykernel/ipkernel.py", line 446, in do_execute
    res = shell.run_cell(
  File "/home/felipe/anaconda3/envs/cuda_12.1/lib/python3.11/site-packages/ipykernel/zmqshell.py", line 549, in run_cell
    return super().run_cell(*args, **kwargs)
  File "/home/felipe/anaconda3/envs/cuda_12.1/lib/python3.11/site-packages/IPython/core/interactiveshell.py", line 3051, in run_cell
    result = self._run_cell(
  File "/home/felipe/anaconda3/envs/cuda_12.1/lib/python3.11/site-packages/IPython/core/interactiveshell.py", line 3106, in _run_cell
    result = runner(coro)
  File "/home/felipe/anaconda3/envs/cuda_12.1/lib/python3.11/site-packages/IPython/core/async_helpers.py", line 129, in _pseudo_sync_runner
    coro.send(None)
  File "/home/felipe/anaconda3/envs/cuda_12.1/lib/python3.11/site-packages/IPython/core/interactiveshell.py", line 3311, in run_cell_async
    has_raised = await self.run_ast_nodes(code_ast.body, cell_name,
  File "/home/felipe/anaconda3/envs/cuda_12.1/lib/python3.11/site-packages/IPython/core/interactiveshell.py", line 3493, in run_ast_nodes
    if await self.run_code(code, result, async_=asy):
  File "/home/felipe/anaconda3/envs/cuda_12.1/lib/python3.11/site-packages/IPython/core/interactiveshell.py", line 3553, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "/tmp/ipykernel_1310472/3735111654.py", line 18, in <module>
    import torch
  File "<frozen importlib._bootstrap>", line 1176, in _find_and_load
  File "<frozen importlib._bootstrap>", line 1147, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 690, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 940, in exec_module
  File "<frozen importlib._bootstrap>", line 241, in _call_with_frames_removed
  File "/home/felipe/anaconda3/envs/cuda_12.1/lib/python3.11/site-packages/torch/__init__.py", line 1421, in <module>
    _C._initExtension(manager_path())
  File "<frozen importlib._bootstrap>", line 1176, in _find_and_load
  File "<frozen importlib._bootstrap>", line 1147, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 690, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 940, in exec_module
  File "<frozen importlib._bootstrap>", line 241, in _call_with_frames_removed
  File "/home/felipe/anaconda3/envs/cuda_12.1/lib/python3.11/site-packages/torch/cuda/__init__.py", line 247, in <module>
    _lazy_call(_check_capability)
  File "/home/felipe/anaconda3/envs/cuda_12.1/lib/python3.11/site-packages/torch/cuda/__init__.py", line 244, in _lazy_call
    _queued_calls.append((callable, traceback.format_stack()))

Expected behavior

It runs with 2 GPUs as it does with 3 GPUs and 3 processes

Feb 18 '24 04:02 MrRobot2211

Hi @MrRobot2211, could you try to run it without setting os.environ["CUDA_VISIBLE_DEVICES"] = "1,2". I want to check if this is the line that causes the issue. Thanks !

Feb 23 '24 16:02 SunMarc

Hello, it does run to completion without setting CUDA visible devices ( it runs on 3 GPUs). If you can point me to a code/tutorial that you are confident should run identically both ways I am happy to try that.

On Fri, Feb 23, 2024, 11:51 Marc Sun @.***> wrote:

Hi @MrRobot2211 https://github.com/MrRobot2211, could you try to run it without setting os.environ["CUDA_VISIBLE_DEVICES"] = "1,2". I want to check if this is the line that causes the issue. Thanks !

— Reply to this email directly, view it on GitHub https://github.com/huggingface/accelerate/issues/2459#issuecomment-1961667118, or unsubscribe https://github.com/notifications/unsubscribe-auth/AFTMVKELX64XJLTKIGIVBYDYVDCKBAVCNFSM6AAAAABDN2YQ62VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSNRRGY3DOMJRHA . You are receiving this because you were mentioned.Message ID: @.***>

Feb 24 '24 23:02 MrRobot2211

Thanks ! I will investigate why it is happening. If you could share a minimal reproducer, that would help me a lot to fix this issue.

Feb 26 '24 16:02 SunMarc

@MrRobot2211 what happens if you set CUDA_VISIBLE_DEVICES before any import to torch/accelerate using os?

IIRC this is needed to happen because torch does some things on import

Feb 26 '24 19:02 muellerzr

@muellerzr yeahp that did it thank you. Incidentally I was able to get rid of this os.environ["NCCL_P2P_DISABLE"]="1" by creating a get_dataloader function.

Feb 27 '24 02:02 MrRobot2211

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

Mar 22 '24 15:03 github-actions[bot]

accelerate accelerate copied to clipboard

Accelerate not working when setting subset of GPUs as visible CUDA devices

System Info

Information

Tasks

Reproduction

Expected behavior

accelerate
accelerate copied to clipboard