accelerate Not possible to use the notebook_launcher on a cluster of A6000 series

System Info

Copy-and-paste the text below in your GitHub issue

- `Accelerate` version: 0.12.0
- Platform: Linux-5.4.0-105-generic-x86_64-with-debian-buster-sid
- Python version: 3.7.13
- Numpy version: 1.21.5
- PyTorch version (GPU?): 1.12.0 (True)
- `Accelerate` default config:
	- compute_environment: LOCAL_MACHINE
	- distributed_type: MULTI_GPU
	- mixed_precision: bf16
	- use_cpu: False
	- num_processes: 8
	- machine_rank: 0
	- num_machines: 1
	- main_process_ip: None
	- main_process_port: None
	- main_training_function: main
	- deepspeed_config: {}
	- fsdp_config: {}
	- downcast_bf16: False

Information

[ ] The official example scripts
[X] My own modified scripts

Tasks

[ ] One of the scripts in the examples/ folder of Accelerate or an officially supported no_trainer script in the examples folder of the transformers repo (such as run_no_trainer_glue.py)
[X] My own task or dataset (give details below)

Reproduction

def training_loop(mixed_precision="bf16", seed: int = 42, batch_size: int = 64):
  set_seed(seed)
  
  accelerator = Accelerator(mixed_precision=mixed_precision)
  
args = ("bf16", 42, 64)
notebook_launcher(training_loop, args, num_processes=8)

Expected behavior

I expect training to start with autocasting to bfloat16 - instead, I get the error: "RuntimeError: Cannot re-initialize CUDA in forked subprocess. To use CUDA with multiprocessing, you must use the 'spawn' start method"

AFAICT, this is due to the "is_bf16_available" function in accelerate.utils, which calls "torch.cuda.is_available()" and "torch.cuda.is_bf16_supported()".

Sep 26 '22 15:09 anshradh

@anshradh can you please provide the full stack trace it gave you?

Sep 26 '22 15:09 muellerzr

Launching training on 8 GPUs.
---------------------------------------------------------------------------
ProcessRaisedException                    Traceback (most recent call last)
[/tmp/ipykernel_1090735/2038238995.py](https://localhost:8080/#) in <module>
      1 args = ("bf16", 42, 64)
----> 2 notebook_launcher(training_loop, args, num_processes=8)

2 frames
[/opt/conda/lib/python3.7/site-packages/accelerate/launchers.py](https://localhost:8080/#) in notebook_launcher(function, args, num_processes, use_fp16, mixed_precision, use_port)
    125 
    126                 print(f"Launching training on {num_processes} GPUs.")
--> 127                 start_processes(launcher, args=args, nprocs=num_processes, start_method="fork")
    128 
    129         else:

[/opt/conda/lib/python3.7/site-packages/torch/multiprocessing/spawn.py](https://localhost:8080/#) in start_processes(fn, args, nprocs, join, daemon, start_method)
    196 
    197     # Loop on join until it returns True or raises an exception.
--> 198     while not context.join():
    199         pass
    200 

[/opt/conda/lib/python3.7/site-packages/torch/multiprocessing/spawn.py](https://localhost:8080/#) in join(self, timeout)
    158         msg = "\n\n-- Process %d terminated with the following error:\n" % error_index
    159         msg += original_trace
--> 160         raise ProcessRaisedException(msg, error_index, failed_process.pid)
    161 
    162 

ProcessRaisedException: 

-- Process 6 terminated with the following error:
Traceback (most recent call last):
  File "/opt/conda/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 69, in _wrap
    fn(i, *args)
  File "/opt/conda/lib/python3.7/site-packages/accelerate/utils/launch.py", line 72, in __call__
    self.launcher(*args)
  File "/tmp/ipykernel_1090735/990520615.py", line 5, in training_loop
    accelerator = Accelerator(mixed_precision=mixed_precision)
  File "/opt/conda/lib/python3.7/site-packages/accelerate/accelerator.py", line 295, in __init__
    self.native_amp = is_bf16_available(True)
  File "/opt/conda/lib/python3.7/site-packages/accelerate/utils/imports.py", line 83, in is_bf16_available
    return torch.cuda.is_bf16_supported()
  File "/opt/conda/lib/python3.7/site-packages/torch/cuda/__init__.py", line 93, in is_bf16_supported
    return torch.cuda.get_device_properties(torch.cuda.current_device()).major >= 8 and cuda_maj_decide
  File "/opt/conda/lib/python3.7/site-packages/torch/cuda/__init__.py", line 482, in current_device
    _lazy_init()
  File "/opt/conda/lib/python3.7/site-packages/torch/cuda/__init__.py", line 208, in _lazy_init
    "Cannot re-initialize CUDA in forked subprocess. To use CUDA with "
RuntimeError: Cannot re-initialize CUDA in forked subprocess. To use CUDA with multiprocessing, you must use the 'spawn' start method

Sep 26 '22 15:09 anshradh

And just to confirm, we're trying to launch bf16 training on 8 GPUs and not TPUs right? :)

Sep 26 '22 18:09 muellerzr

Yup, 8 GPUs!

Sep 26 '22 18:09 anshradh

@anshradh I'm not noticing this on two A100's. Can you verify that the following minimal code works for you in a freshly reset jupyter notebook instance? I've broken it up by cells (which shouldn't matter)

from accelerate import Accelerator, notebook_launcher
from accelerate.utils import set_seed

def training_loop():
    set_seed(42)
    accelerator = Accelerator(mixed_precision="bf16")
    print("Hello There!")

notebook_launcher(training_loop, (), num_processes=2)

(replace 2 with n, you can also try 2)

Sep 26 '22 19:09 muellerzr

Nope, still breaks strangely enough (same error and stack trace).

Sep 26 '22 19:09 anshradh

What GPU are you using in colab? IIRC out of the box they only have single GPU instances (Trying to make my test env match yours as best as humanly possible).

Sep 26 '22 19:09 muellerzr

It's actually connected to a local runtime which consists of 8 A100s.

Sep 26 '22 19:09 anshradh

Sorry 8 A6000s, not A100s!

Sep 26 '22 19:09 anshradh

Hi @anshradh, working on trying to get resources to test this out, but I think I found the cause. Can we try one more time with Accelerate installed from my bugfix branch?

pip install git+https://github.com/huggingface/accelerate@bf16-stas-method

Hopefully it works, thanks!

Sep 27 '22 19:09 muellerzr

Hmm it still seems to be failing on the minimal example - I get the following stack trace:

ProcessRaisedException: 

-- Process 0 terminated with the following error:
Traceback (most recent call last):
  File "/opt/conda/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 69, in _wrap
    fn(i, *args)
  File "/opt/conda/lib/python3.7/site-packages/accelerate/utils/launch.py", line 89, in __call__
    self.launcher(*args)
  File "/tmp/ipykernel_2267126/3328339086.py", line 3, in training_loop
    accelerator = Accelerator(mixed_precision="bf16")
  File "/opt/conda/lib/python3.7/site-packages/accelerate/accelerator.py", line 251, in __init__
    **kwargs,
  File "/opt/conda/lib/python3.7/site-packages/accelerate/state.py", line 158, in __init__
    torch.cuda.set_device(self.device)
  File "/opt/conda/lib/python3.7/site-packages/torch/cuda/__init__.py", line 314, in set_device
    torch._C._cuda_setDevice(device)
  File "/opt/conda/lib/python3.7/site-packages/torch/cuda/__init__.py", line 208, in _lazy_init
    "Cannot re-initialize CUDA in forked subprocess. To use CUDA with "
RuntimeError: Cannot re-initialize CUDA in forked subprocess. To use CUDA with multiprocessing, you must use the 
'spawn' start method

Sep 27 '22 20:09 anshradh

@anshradh actually this seems to be entirely unrelated to bf16. Does it work when setting the mixed precision to none (the default) or fp16?

Sep 27 '22 20:09 muellerzr

Nope, same stack trace for both of those cases.

Sep 27 '22 20:09 anshradh

I'll go ahead and change this as a feature request then and slightly modify the title, because really it's a Feature Request for running the notebook_launcher on the A6000 series. For now I'd recommend through the command line (not the best result I know, but that should work til we can get this!)

Sep 27 '22 20:09 muellerzr

Sounds good, thanks for your help!

Sep 27 '22 21:09 anshradh

No problem! Last question, what provider are you using for renting the 8 A6000 so I can look into it? :)

Sep 27 '22 21:09 muellerzr

https://www.runpod.io/

Sep 27 '22 21:09 anshradh

Just to make sure, does it break also if you don't use Colab? (Sometimes colab does some very weird things, so if we can test in native Jupyter that'd be nice :) )

Sep 28 '22 01:09 muellerzr

Nope, unfortunately no luck with Jupyter either.

Sep 28 '22 02:09 anshradh

Thanks @anshradh, I've gotten into a pod and can recreate the bug. Working on a fix!

Sep 28 '22 13:09 muellerzr

This is indeed a torch backend issue, you can track it's progress here: https://github.com/pytorch/pytorch/issues/85841

Sep 28 '22 19:09 muellerzr

Oh interesting, thanks for diving into this!

Sep 28 '22 20:09 anshradh

accelerate accelerate copied to clipboard

Not possible to use the notebook_launcher on a cluster of A6000 series

System Info

Information

Tasks

Reproduction

Expected behavior

accelerate
accelerate copied to clipboard