fsdp_qlora
fsdp_qlora copied to clipboard
RuntimeError: An attempt has been made to start a new process before the current process has finished its bootstrapping phase
Hello everyone!
First, thank you for this implementation!
Unfortunately I have an issue with running this, RuntimeError: An attempt has been made to start a new process before the current process has finished its bootstrapping phase.
I debugged it a bit and it seems that PEFT v.0.9 breaks it. The previous release PEFT v0.8.2 works fine. The fix is to downgrade or move all the peft imports in train.py
inside the functions where they are used, like this: https://github.com/geronimi73/fsdp_qlora/tree/fix_ProcessExitedException
I'm not sure whether I am doing something wrong and how come nobody else noticed this, since PEFT 0.9 has been released two weeks ago already. Any ideas what might be wrong?
command:
python train.py \
--model_name models/llama2-7b \
--gradient_accumulation_steps 4 \
--batch_size 8 \
--context_length 512 \
--precision bf16 \
--train_type full \
--use_gradient_checkpointing true \
--use_cpu_offload false \
--use_activation_cpu_offload false \
--log_to wandb \
--dataset alpaca
Note: models/llama2-7b
is meta-llama/Llama-2-7b-hf
stacktrace:
Traceback (most recent call last):
File "<string>", line 1, in <module>
File "/usr/lib/python3.10/multiprocessing/spawn.py", line 116, in spawn_main
exitcode = _main(fd, parent_sentinel)
File "/usr/lib/python3.10/multiprocessing/spawn.py", line 125, in _main
prepare(preparation_data)
File "/usr/lib/python3.10/multiprocessing/spawn.py", line 236, in prepare
_fixup_main_from_path(data['init_main_from_path'])
File "/usr/lib/python3.10/multiprocessing/spawn.py", line 287, in _fixup_main_from_path
main_content = runpy.run_path(main_path,
File "/usr/lib/python3.10/runpy.py", line 289, in run_path
return _run_module_code(code, init_globals, run_name,
File "/usr/lib/python3.10/runpy.py", line 96, in _run_module_code
_run_code(code, mod_globals, init_globals,
File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
exec(code, run_globals)
File "/home/g/fsdp_qlora/train.py", line 939, in <module>
def main(
File "/home/g/.local/lib/python3.10/site-packages/fastcore/script.py", line 125, in call_parse
return _f()
File "/home/g/.local/lib/python3.10/site-packages/fastcore/script.py", line 119, in _f
return tfunc(**merge(args, args_from_prog(func, xtra)))
File "/home/g/fsdp_qlora/train.py", line 1010, in main
mp.spawn(fsdp_main,
File "/home/g/.local/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 241, in spawn
return start_processes(fn, args, nprocs, join, daemon, start_method="spawn")
File "/home/g/.local/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 188, in start_processes
process.start()
File "/usr/lib/python3.10/multiprocessing/process.py", line 121, in start
self._popen = self._Popen(self)
File "/usr/lib/python3.10/multiprocessing/context.py", line 288, in _Popen
return Popen(process_obj)
File "/usr/lib/python3.10/multiprocessing/popen_spawn_posix.py", line 32, in __init__
super().__init__(process_obj)
File "/usr/lib/python3.10/multiprocessing/popen_fork.py", line 19, in __init__
self._launch(process_obj)
File "/usr/lib/python3.10/multiprocessing/popen_spawn_posix.py", line 42, in _launch
prep_data = spawn.get_preparation_data(process_obj._name)
File "/usr/lib/python3.10/multiprocessing/spawn.py", line 154, in get_preparation_data
_check_not_importing_main()
File "/usr/lib/python3.10/multiprocessing/spawn.py", line 134, in _check_not_importing_main
raise RuntimeError('''
RuntimeError:
An attempt has been made to start a new process before the
current process has finished its bootstrapping phase.
This probably means that you are not using fork to start your
child processes and you have forgotten to use the proper idiom
in the main module:
if __name__ == '__main__':
freeze_support()
...
The "freeze_support()" line can be omitted if the program
is not going to be frozen to produce an executable.
World size: 2
Traceback (most recent call last):
File "<string>", line 1, in <module>
File "/usr/lib/python3.10/multiprocessing/spawn.py", line 116, in spawn_main
exitcode = _main(fd, parent_sentinel)
File "/usr/lib/python3.10/multiprocessing/spawn.py", line 125, in _main
prepare(preparation_data)
File "/usr/lib/python3.10/multiprocessing/spawn.py", line 236, in prepare
_fixup_main_from_path(data['init_main_from_path'])
File "/usr/lib/python3.10/multiprocessing/spawn.py", line 287, in _fixup_main_from_path
main_content = runpy.run_path(main_path,
File "/usr/lib/python3.10/runpy.py", line 289, in run_path
return _run_module_code(code, init_globals, run_name,
File "/usr/lib/python3.10/runpy.py", line 96, in _run_module_code
_run_code(code, mod_globals, init_globals,
File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
exec(code, run_globals)
File "/home/g/fsdp_qlora/train.py", line 939, in <module>
def main(
File "/home/g/.local/lib/python3.10/site-packages/fastcore/script.py", line 125, in call_parse
return _f()
File "/home/g/.local/lib/python3.10/site-packages/fastcore/script.py", line 119, in _f
return tfunc(**merge(args, args_from_prog(func, xtra)))
File "/home/g/fsdp_qlora/train.py", line 1010, in main
mp.spawn(fsdp_main,
File "/home/g/.local/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 241, in spawn
return start_processes(fn, args, nprocs, join, daemon, start_method="spawn")
File "/home/g/.local/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 188, in start_processes
process.start()
File "/usr/lib/python3.10/multiprocessing/process.py", line 121, in start
self._popen = self._Popen(self)
File "/usr/lib/python3.10/multiprocessing/context.py", line 288, in _Popen
return Popen(process_obj)
File "/usr/lib/python3.10/multiprocessing/popen_spawn_posix.py", line 32, in __init__
super().__init__(process_obj)
File "/usr/lib/python3.10/multiprocessing/popen_fork.py", line 19, in __init__
self._launch(process_obj)
File "/usr/lib/python3.10/multiprocessing/popen_spawn_posix.py", line 42, in _launch
prep_data = spawn.get_preparation_data(process_obj._name)
File "/usr/lib/python3.10/multiprocessing/spawn.py", line 154, in get_preparation_data
_check_not_importing_main()
File "/usr/lib/python3.10/multiprocessing/spawn.py", line 134, in _check_not_importing_main
raise RuntimeError('''
RuntimeError:
An attempt has been made to start a new process before the
current process has finished its bootstrapping phase.
This probably means that you are not using fork to start your
child processes and you have forgotten to use the proper idiom
in the main module:
if __name__ == '__main__':
freeze_support()
...
The "freeze_support()" line can be omitted if the program
is not going to be frozen to produce an executable.
Traceback (most recent call last):
File "/home/g/fsdp_qlora/train.py", line 939, in <module>
def main(
File "/home/g/.local/lib/python3.10/site-packages/fastcore/script.py", line 125, in call_parse
return _f()
File "/home/g/.local/lib/python3.10/site-packages/fastcore/script.py", line 119, in _f
return tfunc(**merge(args, args_from_prog(func, xtra)))
File "/home/g/fsdp_qlora/train.py", line 1010, in main
mp.spawn(fsdp_main,
File "/home/g/.local/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 241, in spawn
return start_processes(fn, args, nprocs, join, daemon, start_method="spawn")
File "/home/g/.local/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 197, in start_processes
while not context.join():
File "/home/g/.local/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 148, in join
raise ProcessExitedException(
torch.multiprocessing.spawn.ProcessExitedException: process 0 terminated with exit code 1
pip list:
accelerate 0.28.0
bitsandbytes 0.43.0
fastcore 1.5.29
flash-attn 2.5.6
hqq 0.1.5
peft 0.9.0
torch 2.2.1
transformers 4.38.2
2x 3090, CUDA Version: 12.2
I am unable to replicate this issue on both fresh python 3.11 and 3.10 environments with PEFT 0.9.0.
accelerate 0.27.2
bitsandbytes 0.43.0
datasets 2.18.0
hqq 0.1.5
hqq-aten 0.0.0
huggingface-hub 0.21.4
llama-recipes 0.0.1
peft 0.9.0
safetensors 0.4.2
tokenizers 0.15.2
torch 2.2.1+cu121
transformers 4.38.2
I am unable to replicate this issue on both fresh python 3.11 and 3.10 environments
Yes, clean environment works for me as well.
Spent 2hrs uninstalling stuff and I found the combination that breaks it: autoawq==0.2.2
+ peft==0.9.0
. It's something in
autoawq that breaks it. Autoawq imports were introduced in peft 0.9.0 that's why peft 0.8.2 works.
You can try to reproduce this weirdness by
- installing
autoawq==0.2.2
andpeft==0.9.0
and runningtrain.py
- or just installing/upgrading to
peft==0.9.0
and addingimport awq
to the other imports intrain.py
no idea what's going on
@warner-benjamin I was able to replicate this: installing autoawq==0.2.2 broke a previously working system. @geronimi73 if you open a PR with your fix we can include it just in case others hit this same issue?