LLaMA-Factory
LLaMA-Factory copied to clipboard
cannot use pure_bf16 with zero3 cpu offload
Reminder
- [X] I have read the README and searched the existing issues.
Reproduction
I'm trying to do full sft for mixtral 8x22B, I used 2 8xa100(80g) instances. For the first try, i use pure_bf16 with zero3 but i get GPU OOM. Then i switched to zero3 with cpu offload, but i get:
Traceback (most recent call last):
File "/opt/conda/envs/ptca/lib/python3.10/runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/opt/conda/envs/ptca/lib/python3.10/runpy.py", line 86, in _run_code
exec(code, run_globals)
File "/opt/conda/envs/ptca/lib/python3.10/site-packages/torch/distributed/run.py", line 816, in
Expected behavior
No response
System Info
No response
Others
No response
could you try using bf16 + pure_bf16 param?
@hiyouga can you be more specific: --pure_bf16 can be used with --bf16 together? and w should i use cpu offload too?
yep, use both of the params
Thanks, I use --pure_bf16 and --bf16 together with the ds3_cpu_offload deepspeed config, but still get the same error. I'm using this command: https://github.com/hiyouga/LLaMA-Factory/blob/main/examples/full_multi_gpu/multi_node.sh is it because i use torch.distributed.run? how can i convert it to deepspeed?
now pure_bf16 does not support deepspeed, use either one of both