accelerate
accelerate copied to clipboard
Not possible to use the notebook_launcher on a cluster of A6000 series
System Info
Copy-and-paste the text below in your GitHub issue
- `Accelerate` version: 0.12.0
- Platform: Linux-5.4.0-105-generic-x86_64-with-debian-buster-sid
- Python version: 3.7.13
- Numpy version: 1.21.5
- PyTorch version (GPU?): 1.12.0 (True)
- `Accelerate` default config:
- compute_environment: LOCAL_MACHINE
- distributed_type: MULTI_GPU
- mixed_precision: bf16
- use_cpu: False
- num_processes: 8
- machine_rank: 0
- num_machines: 1
- main_process_ip: None
- main_process_port: None
- main_training_function: main
- deepspeed_config: {}
- fsdp_config: {}
- downcast_bf16: False
Information
- [ ] The official example scripts
- [X] My own modified scripts
Tasks
- [ ] One of the scripts in the examples/ folder of Accelerate or an officially supported
no_trainer
script in theexamples
folder of thetransformers
repo (such asrun_no_trainer_glue.py
) - [X] My own task or dataset (give details below)
Reproduction
def training_loop(mixed_precision="bf16", seed: int = 42, batch_size: int = 64):
set_seed(seed)
accelerator = Accelerator(mixed_precision=mixed_precision)
args = ("bf16", 42, 64)
notebook_launcher(training_loop, args, num_processes=8)
Expected behavior
I expect training to start with autocasting to bfloat16 - instead, I get the error: "RuntimeError: Cannot re-initialize CUDA in forked subprocess. To use CUDA with multiprocessing, you must use the 'spawn' start method"
AFAICT, this is due to the "is_bf16_available" function in accelerate.utils, which calls "torch.cuda.is_available()" and "torch.cuda.is_bf16_supported()".
@anshradh can you please provide the full stack trace it gave you?
Launching training on 8 GPUs.
---------------------------------------------------------------------------
ProcessRaisedException Traceback (most recent call last)
[/tmp/ipykernel_1090735/2038238995.py](https://localhost:8080/#) in <module>
1 args = ("bf16", 42, 64)
----> 2 notebook_launcher(training_loop, args, num_processes=8)
2 frames
[/opt/conda/lib/python3.7/site-packages/accelerate/launchers.py](https://localhost:8080/#) in notebook_launcher(function, args, num_processes, use_fp16, mixed_precision, use_port)
125
126 print(f"Launching training on {num_processes} GPUs.")
--> 127 start_processes(launcher, args=args, nprocs=num_processes, start_method="fork")
128
129 else:
[/opt/conda/lib/python3.7/site-packages/torch/multiprocessing/spawn.py](https://localhost:8080/#) in start_processes(fn, args, nprocs, join, daemon, start_method)
196
197 # Loop on join until it returns True or raises an exception.
--> 198 while not context.join():
199 pass
200
[/opt/conda/lib/python3.7/site-packages/torch/multiprocessing/spawn.py](https://localhost:8080/#) in join(self, timeout)
158 msg = "\n\n-- Process %d terminated with the following error:\n" % error_index
159 msg += original_trace
--> 160 raise ProcessRaisedException(msg, error_index, failed_process.pid)
161
162
ProcessRaisedException:
-- Process 6 terminated with the following error:
Traceback (most recent call last):
File "/opt/conda/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 69, in _wrap
fn(i, *args)
File "/opt/conda/lib/python3.7/site-packages/accelerate/utils/launch.py", line 72, in __call__
self.launcher(*args)
File "/tmp/ipykernel_1090735/990520615.py", line 5, in training_loop
accelerator = Accelerator(mixed_precision=mixed_precision)
File "/opt/conda/lib/python3.7/site-packages/accelerate/accelerator.py", line 295, in __init__
self.native_amp = is_bf16_available(True)
File "/opt/conda/lib/python3.7/site-packages/accelerate/utils/imports.py", line 83, in is_bf16_available
return torch.cuda.is_bf16_supported()
File "/opt/conda/lib/python3.7/site-packages/torch/cuda/__init__.py", line 93, in is_bf16_supported
return torch.cuda.get_device_properties(torch.cuda.current_device()).major >= 8 and cuda_maj_decide
File "/opt/conda/lib/python3.7/site-packages/torch/cuda/__init__.py", line 482, in current_device
_lazy_init()
File "/opt/conda/lib/python3.7/site-packages/torch/cuda/__init__.py", line 208, in _lazy_init
"Cannot re-initialize CUDA in forked subprocess. To use CUDA with "
RuntimeError: Cannot re-initialize CUDA in forked subprocess. To use CUDA with multiprocessing, you must use the 'spawn' start method
And just to confirm, we're trying to launch bf16 training on 8 GPUs and not TPUs right? :)
Yup, 8 GPUs!
@anshradh I'm not noticing this on two A100's. Can you verify that the following minimal code works for you in a freshly reset jupyter notebook instance? I've broken it up by cells (which shouldn't matter)
from accelerate import Accelerator, notebook_launcher
from accelerate.utils import set_seed
def training_loop():
set_seed(42)
accelerator = Accelerator(mixed_precision="bf16")
print("Hello There!")
notebook_launcher(training_loop, (), num_processes=2)
(replace 2 with n
, you can also try 2)
Nope, still breaks strangely enough (same error and stack trace).
What GPU are you using in colab? IIRC out of the box they only have single GPU instances (Trying to make my test env match yours as best as humanly possible).
It's actually connected to a local runtime which consists of 8 A100s.
Sorry 8 A6000s, not A100s!
Hi @anshradh, working on trying to get resources to test this out, but I think I found the cause. Can we try one more time with Accelerate installed from my bugfix branch?
pip install git+https://github.com/huggingface/accelerate@bf16-stas-method
Hopefully it works, thanks!
Hmm it still seems to be failing on the minimal example - I get the following stack trace:
ProcessRaisedException:
-- Process 0 terminated with the following error:
Traceback (most recent call last):
File "/opt/conda/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 69, in _wrap
fn(i, *args)
File "/opt/conda/lib/python3.7/site-packages/accelerate/utils/launch.py", line 89, in __call__
self.launcher(*args)
File "/tmp/ipykernel_2267126/3328339086.py", line 3, in training_loop
accelerator = Accelerator(mixed_precision="bf16")
File "/opt/conda/lib/python3.7/site-packages/accelerate/accelerator.py", line 251, in __init__
**kwargs,
File "/opt/conda/lib/python3.7/site-packages/accelerate/state.py", line 158, in __init__
torch.cuda.set_device(self.device)
File "/opt/conda/lib/python3.7/site-packages/torch/cuda/__init__.py", line 314, in set_device
torch._C._cuda_setDevice(device)
File "/opt/conda/lib/python3.7/site-packages/torch/cuda/__init__.py", line 208, in _lazy_init
"Cannot re-initialize CUDA in forked subprocess. To use CUDA with "
RuntimeError: Cannot re-initialize CUDA in forked subprocess. To use CUDA with multiprocessing, you must use the
'spawn' start method
@anshradh actually this seems to be entirely unrelated to bf16. Does it work when setting the mixed precision to none (the default) or fp16?
Nope, same stack trace for both of those cases.
I'll go ahead and change this as a feature request then and slightly modify the title, because really it's a Feature Request for running the notebook_launcher
on the A6000 series. For now I'd recommend through the command line (not the best result I know, but that should work til we can get this!)
Sounds good, thanks for your help!
No problem! Last question, what provider are you using for renting the 8 A6000 so I can look into it? :)
https://www.runpod.io/
Just to make sure, does it break also if you don't use Colab? (Sometimes colab does some very weird things, so if we can test in native Jupyter that'd be nice :) )
Nope, unfortunately no luck with Jupyter either.
Thanks @anshradh, I've gotten into a pod and can recreate the bug. Working on a fix!
This is indeed a torch backend issue, you can track it's progress here: https://github.com/pytorch/pytorch/issues/85841
Oh interesting, thanks for diving into this!