LLaMA-Factory cannot use pure_bf16 with zero3 cpu offload

Reminder

[X] I have read the README and searched the existing issues.

Reproduction

I'm trying to do full sft for mixtral 8x22B, I used 2 8xa100(80g) instances. For the first try, i use pure_bf16 with zero3 but i get GPU OOM. Then i switched to zero3 with cpu offload, but i get: Traceback (most recent call last): File "/opt/conda/envs/ptca/lib/python3.10/runpy.py", line 196, in _run_module_as_main return _run_code(code, main_globals, None, File "/opt/conda/envs/ptca/lib/python3.10/runpy.py", line 86, in _run_code exec(code, run_globals) File "/opt/conda/envs/ptca/lib/python3.10/site-packages/torch/distributed/run.py", line 816, in main() File "/opt/conda/envs/ptca/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 347, in wrapper return f(*args, **kwargs) File "/opt/conda/envs/ptca/lib/python3.10/site-packages/torch/distributed/run.py", line 812, in main run(args) File "/opt/conda/envs/ptca/lib/python3.10/site-packages/torch/distributed/run.py", line 803, in run elastic_launch( File "/opt/conda/envs/ptca/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 135, in call return launch_agent(self._config, self._entrypoint, list(args)) File "/opt/conda/envs/ptca/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 268, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

Expected behavior

No response

System Info

No response

Others

No response

Apr 27 '24 00:04 mces89

could you try using bf16 + pure_bf16 param?

Apr 27 '24 01:04 hiyouga

@hiyouga can you be more specific: --pure_bf16 can be used with --bf16 together? and w should i use cpu offload too?

Apr 27 '24 01:04 mces89

yep, use both of the params

Apr 27 '24 03:04 hiyouga

Thanks, I use --pure_bf16 and --bf16 together with the ds3_cpu_offload deepspeed config, but still get the same error. I'm using this command: https://github.com/hiyouga/LLaMA-Factory/blob/main/examples/full_multi_gpu/multi_node.sh is it because i use torch.distributed.run? how can i convert it to deepspeed?

Apr 27 '24 04:04 mces89

now pure_bf16 does not support deepspeed, use either one of both

May 24 '24 15:05 hiyouga

LLaMA-Factory LLaMA-Factory copied to clipboard

cannot use pure_bf16 with zero3 cpu offload

Reminder

Reproduction

Expected behavior

System Info

Others

LLaMA-Factory
LLaMA-Factory copied to clipboard