nebuly icon indicating copy to clipboard operation
nebuly copied to clipboard

[Chatllama]: MultiGPU support for training

Open TejaGollapudi opened this issue 1 year ago • 8 comments

I'm trying to train the actor model (BLOOM 1.5B) on a multi-GPU setup (3-V100s). When I observe the GPU usage, only the GPU:0 is utilized and I run out of memory if I increase the batch_size.

Could you add multi-GPU support using HuggingFace's accelerate to facilitate the training of larger models with a larger batch size?

Thank you

TejaGollapudi avatar Mar 11 '23 22:03 TejaGollapudi

Hi @TejaGollapudi, thank you very much for reaching out. We are currently working on supporting the Accelerate library. You can check directly the updates on the PR #233.

diegofiori avatar Mar 12 '23 09:03 diegofiori

I added accelerate in the code as #233 ,bug got error: Traceback (most recent call last): File "/nvmessd0/nebullvm/apps/accelerate/chatllama/artifacts/main.py", line 3, in from chatllama.rlhf.actor import ActorTrainer File "/home/spzq/.local/lib/python3.10/site-packages/chatllama/rlhf/actor.py", line 17, in from chatllama.rlhf.config import ConfigActor File "/home/spzq/.local/lib/python3.10/site-packages/chatllama/rlhf/config.py", line 71, in class ConfigActor: File "/usr/lib/python3.10/dataclasses.py", line 1187, in dataclass return wrap(cls) File "/usr/lib/python3.10/dataclasses.py", line 1178, in wrap return _process_class(cls, init, repr, eq, order, unsafe_hash, File "/usr/lib/python3.10/dataclasses.py", line 1027, in _process_class _init_fn(all_init_fields, File "/usr/lib/python3.10/dataclasses.py", line 548, in _init_fn raise TypeError(f'non-default argument {f.name!r} ' TypeError: non-default argument 'device' follows default argument

leonselina avatar Mar 14 '23 03:03 leonselina

@leonselina We will be releasing support for Accelerate very soon! We are currently testing the code and will keep you updated when we merge the code!

PierpaoloSorbellini avatar Mar 14 '23 09:03 PierpaoloSorbellini

when would this MultiGPU support be available? Really looking forward to it.

balcklive avatar Mar 17 '23 02:03 balcklive

Also looking forward to it!

bin123apple avatar Mar 17 '23 03:03 bin123apple

Hi Everyone @bin123apple @balcklive @TejaGollapudi . you can try the PR #306 where deepspeed and accelerate should be working fine. keep in mind to launch the training with "deepspeed arifacts/main.py .." or "accelerate launch" instead of using "python" If you have any other problem on the matter let me know!

PierpaoloSorbellini avatar Apr 03 '23 14:04 PierpaoloSorbellini

Hi @PierpaoloSorbellini , I trained Llama 7B with deepspeed, but got error: "MP=1 but world size is 2". How can I train Llama 7B with multi-GPU? because the limits of VRAM , maybe I should use model_parallel instead data_parallel for multi-GPU training. thanx:)

leonselina avatar Apr 04 '23 07:04 leonselina

@PierpaoloSorbellini hey I try to try llama with hf format and I use deepseep with --num_gpus =2. The model was loaded twice and they were all loaded to the rank0 gpu which caused cuda oom.

image

do you have ideas to fix this problem?

laozhanghahaha avatar Apr 04 '23 10:04 laozhanghahaha