nebuly [Chatllama]: MultiGPU support for training

I'm trying to train the actor model (BLOOM 1.5B) on a multi-GPU setup (3-V100s). When I observe the GPU usage, only the GPU:0 is utilized and I run out of memory if I increase the batch_size.

Could you add multi-GPU support using HuggingFace's accelerate to facilitate the training of larger models with a larger batch size?

Thank you

Mar 11 '23 22:03 TejaGollapudi

Hi @TejaGollapudi, thank you very much for reaching out. We are currently working on supporting the Accelerate library. You can check directly the updates on the PR #233.

Mar 12 '23 09:03 diegofiori

I added accelerate in the code as #233 ,bug got error: Traceback (most recent call last): File "/nvmessd0/nebullvm/apps/accelerate/chatllama/artifacts/main.py", line 3, in from chatllama.rlhf.actor import ActorTrainer File "/home/spzq/.local/lib/python3.10/site-packages/chatllama/rlhf/actor.py", line 17, in from chatllama.rlhf.config import ConfigActor File "/home/spzq/.local/lib/python3.10/site-packages/chatllama/rlhf/config.py", line 71, in class ConfigActor: File "/usr/lib/python3.10/dataclasses.py", line 1187, in dataclass return wrap(cls) File "/usr/lib/python3.10/dataclasses.py", line 1178, in wrap return _process_class(cls, init, repr, eq, order, unsafe_hash, File "/usr/lib/python3.10/dataclasses.py", line 1027, in _process_class _init_fn(all_init_fields, File "/usr/lib/python3.10/dataclasses.py", line 548, in _init_fn raise TypeError(f'non-default argument {f.name!r} ' TypeError: non-default argument 'device' follows default argument

Mar 14 '23 03:03 leonselina

@leonselina We will be releasing support for Accelerate very soon! We are currently testing the code and will keep you updated when we merge the code!

Mar 14 '23 09:03 PierpaoloSorbellini

when would this MultiGPU support be available? Really looking forward to it.

Mar 17 '23 02:03 balcklive

Also looking forward to it!

Mar 17 '23 03:03 bin123apple

Hi Everyone @bin123apple @balcklive @TejaGollapudi . you can try the PR #306 where deepspeed and accelerate should be working fine. keep in mind to launch the training with "deepspeed arifacts/main.py .." or "accelerate launch" instead of using "python" If you have any other problem on the matter let me know!

Apr 03 '23 14:04 PierpaoloSorbellini

Hi @PierpaoloSorbellini , I trained Llama 7B with deepspeed, but got error: "MP=1 but world size is 2". How can I train Llama 7B with multi-GPU? because the limits of VRAM , maybe I should use model_parallel instead data_parallel for multi-GPU training. thanx:)

Apr 04 '23 07:04 leonselina

@PierpaoloSorbellini hey I try to try llama with hf format and I use deepseep with --num_gpus =2. The model was loaded twice and they were all loaded to the rank0 gpu which caused cuda oom.

do you have ideas to fix this problem?

Apr 04 '23 10:04 laozhanghahaha

nebuly nebuly copied to clipboard

[Chatllama]: MultiGPU support for training

nebuly
nebuly copied to clipboard