alpaca-lora Enabling model parallelism (training 30b on 2x 3090s and beyond)

Does as it says on the tin. Now multi-gpu users can choose to use them for faster training (DDP) or bigger models (MP)

This required a minor change to the transformers library. It has been merged: PR. Just update by reinstalling the transformers module.

This also serves as a workaround for #8

Mar 23 '23 05:03 kooshi

Perhaps it doesn't work on windows? Fresh install of this fork, with MICRO_BATCH_SIZE = 64, only uses a single GPU, even though gpus = torch.cuda.device_count() correctly detects 2 GPUs. ddp remains False because WORLD_SIZE is not set. Setting PIPE_CHUNKS = 1 gives an error:

Traceback (most recent call last):
  File "E:\LLaMA-train\alpaca-lora\finetune.py", line 142, in <module>
    from torch.distributed.pipeline.sync import Pipe
  File "E:\LLaMA-train\installer_files\env\lib\site-packages\torch\distributed\pipeline\sync\__init__.py", line 9, in <module>
    from .pipe import Pipe, WithDevice
  File "E:\LLaMA-train\installer_files\env\lib\site-packages\torch\distributed\pipeline\sync\pipe.py", line 13, in <module>
    from torch.distributed.rpc import RRef
ImportError: cannot import name 'RRef' from 'torch.distributed.rpc' (E:\LLaMA-train\installer_files\env\lib\site-packages\torch\distributed\rpc\__init__.py)

Here are some relevant libs (transformers is also installed from the correct fork git+https://github.com/kooshi/transformers.git@llama-parallelism):

tokenizers              0.13.2
torch                   2.0.0
torchaudio              2.0.0
torchvision             0.15.0
tqdm                    4.65.0
transformers            4.28.0.dev0

Mar 23 '23 15:03 HideLord

Hm, maybe the rpc bit isn't supported on windows. Keep PIPE_CHUNKS = 0, and try manually setting max_memory as below.

device_map = "auto" is supposed to distribute the model across gpus, but for some reason, when loading in 8bit, it simply doesn't. You can force the behavior with max_memory.

Each gpu should have roughly (size of model) / gpus. So for 30b at 8bit on 2gpus it should be max_memory={0: "15GB", 1: "15GB"} for 13b, max_memory={0: "7GB", 1: "7GB"} and so on. It's unfortunate that you need to fiddle with it manually... maybe I'll look into why 8bit causes device_map to be ignored tonight.

Since you're on Windows, use MSI Afterburner or something similar to monitor the VRAM usage to make sure the model is loaded across both before training begins. For anyone on Linux, use nvtop.

Mar 23 '23 17:03 kooshi

That kind of worked. file

Mar 23 '23 18:03 HideLord

Yeah, that's what you should expect to see. Model parallelism can't load both GPUs all the time. It just goes back and forth between them, but it means you can load larger models or use larger batch sizes. Some new frameworks have some fancy tricks to reduce the inefficiency, but this is just the simplest case. See here for more info: https://pytorch.org/docs/stable/pipeline.html

The Pipeline should be one of those tricks, but I don't think I fully implemented it. I'll need to play with that some more later.

But for now, yeah, looks like it's working as expected for you.

Mar 23 '23 19:03 kooshi

I found the root cause of the device_map auto discrepancy in the transformers repo, so I'm going to draft this until I get that fixed and merged

Mar 24 '23 16:03 kooshi

It is possible to support Deepspeed stage 3 to do parameter partitioning to fit large model into multiple gpus? Will it be faster than the currently naive (non-overlapping) pipeline parallelism?

Mar 25 '23 18:03 sgsdxzy

I'm not sure, I'll need to look into Deepspeed more. I had played with it for a minute and I think it didn't support 8bit. I'll add it to my list of things to look at, because better parallelism would be awesome. I mostly know how to get full pipelining working, but Deepspeed would be more valuable.

Mar 25 '23 19:03 kooshi

How are you planning to implement full pipeline? I searched for examples and docs, and I think all leads to modifying the implementation of llama in transformers, which I would recognize as last resort.

Mar 25 '23 19:03 sgsdxzy

Correct. It's not trivial, but it's not terrible either. I started working on a quick and dirty experiment of it the other day. It's in the llama-parallelism branch of my transformers fork.

I stopped when I realized I also need to batch the inputs into microbatches in a single tensor. I was also using the pipeline for a little more than what it was designed for, so it was breaking in weird ways. Please do take a look if you're interested though, be warned it's very hacky and broken.

Mar 25 '23 23:03 kooshi

Hi,

Thanks for this work!

I'm experimenting multiple configs to find the best matches for my use cases. Linux, 2x3090. I'm able to train 7b and 13b on both of them with ddp.

I'm now trying to train the 30b, but I keep getting OOM.

transformers was upgraded
world_size to 1, I made sure ddp was off.
2 gpus detected
I tried to force 15GB/15GB as max_memory

Still, while "Loading checkpoint shards:" it breaks with OOM, having filled the first GPU up, second one is almost unused.

Any idea what I could do wrong?

Mar 26 '23 13:03 AngainorDev

@AngainorDev how did you force max_memory? I edited finetune.py line 78 to be

model = LlamaForCausalLM.from_pretrained(
        base_model,
        load_in_8bit=True,
        device_map=device_map,
        max_memory={0: "11776MB", 1: "11776MB", 2: "11776MB"}
    )

And I can train 30B on 2080Ti 22G x 3 withmicro_batch_size=16. But one epoch would take >30h because naive model parallel training is very inefficient.

Mar 26 '23 14:03 sgsdxzy

@AngainorDev I just pushed a change that references my fork of transformers. I was hoping they would merge the PR in quickly, but since they're a company, it seems like they won't get to it till Monday. To install it,

git pull
pip uninstall transformers
pip install -r transformers.txt

With that, you won't need to use a hard coded max_memory, and you can just use "auto" device map for a perfect distribution.

Mar 26 '23 14:03 kooshi

Correct. It's not trivial, but it's not terrible either. I started working on a quick and dirty experiment of it the other day. It's in the llama-parallelism branch of my transformers fork.

I stopped when I realized I also need to batch the inputs into microbatches in a single tensor. I was also using the pipeline for a little more than what it was designed for, so it was breaking in weird ways. Please do take a look if you're interested though, be warned it's very hacky and broken.

I find deepspeed pipeline parallelism very promising: you just need to change the input and output of each layer to tuple of tensor, and deepspeed can do the rest for you, including micro batching, etc. It has much relaxed constrants than pytorch pipe: you don't need to express the model in nn.Sequential (just a list of python callables), each layer does not need to be a nn.Module (any python callable), and the input/output can be a tuple of tensors, not limited to one tensor. Because Llama has only one layer type: LlamaDecoderLayer, I thinks it could be relatively easy to wrap the layer in a wrapper that simply pack and unpack parameters as tuples. Are you interested in implementing this? I might try to do it as well, but I am new to ML (I just installed torch weeks ago) so it might take me a long time before it can work.

Mar 26 '23 14:03 sgsdxzy

I just updated to git+https://github.com/kooshi/transformers.git@balanced_memory_8bit

how did you force max_memory? I edited finetune.py line 78 to be

I used max_memory={0: "15GB", 1: "15GB"},

This seems to have no effect, gpu 0 taking it all and oom at 70% of model loading.

Mar 26 '23 14:03 AngainorDev

Sounds like it's not even seeing the second gpu as available or something. Make sure CUDA_VISIBLE_DEVICES is set correctly.

Mar 26 '23 17:03 kooshi

Yeah,

But torch.cuda.device_count() correctly detects the 2 GPUs. CUDA_VISIBLE_DEVICES was not set, I explicitely set it to CUDA_VISIBLE_DEVICES=0,1 , no change.

Second one gets a bit of vram used when running, around 1GB. Both are successfully used with ddp on smaller models.

Mar 26 '23 19:03 AngainorDev

My second PR for transformers was merged in, so now the only thing required to use model parallelism is reinstalling transformers, and merging the few lines left. I'm not sure what's going on with @AngainorDev, because in that case it's behaving as if it's ignoring the proven fixes of the manual max_memory, or the updated load_in_8bit logic in transformers. I have to imagine something is configured incorrectly or is somehow overriding the correct behavior.

@AngainorDev my next suggestion would be to attempt a clean slate. Set up a brand new conda environment, install the latest supported libraries, and run this code, unmodified, just to see if it can work at all before changes.

This PR is ready to be merged.

Mar 27 '23 15:03 kooshi

Thanks for the follow up. Agree, something could be broken in my setup, I'll do from a clean one next time I'll try, thanks!

Mar 27 '23 15:03 AngainorDev

Already use this update to train with MP and it works well! I train 13B model with 2x3090 with cutoff len 512 + batch size 24

Mar 28 '23 09:03 KohakuBlueleaf

I successfully finetune the 30b model on multi gpu by pipeline parallelism But when i set load_in_8bit=False，it cause RuntimeError:

  File "/home/usr/project/alpaca-lora/finetune.py", line 288, in <module>
    fire.Fire(train)
  File "/home/usr/.conda/envs/lora/lib/python3.9/site-packages/fire/core.py", line 141, in Fire
    component_trace = _Fire(component, args, parsed_flag_args, context, name)
  File "/home/usr/.conda/envs/lora/lib/python3.9/site-packages/fire/core.py", line 475, in _Fire
    component, remaining_args = _CallAndUpdateTrace(
  File "/home/usr/.conda/envs/lora/lib/python3.9/site-packages/fire/core.py", line 691, in _CallAndUpdateTrace
    component = fn(*varargs, **kwargs)
  File "/home/usr/project/alpaca-lora/finetune.py", line 255, in train
    trainer.train(resume_from_checkpoint=resume_from_checkpoint)
  File "/media/data/6/usr/tmp/transformers/src/transformers/trainer.py", line 1636, in train
    return inner_training_loop(
  File "/media/data/6/usr/tmp/transformers/src/transformers/trainer.py", line 1903, in _inner_training_loop
    tr_loss_step = self.training_step(model, inputs)
  File "/media/data/6/usr/tmp/transformers/src/transformers/trainer.py", line 2649, in training_step
    loss = self.compute_loss(model, inputs)
  File "/media/data/6/usr/tmp/transformers/src/transformers/trainer.py", line 2681, in compute_loss
    outputs = model(**inputs)
  File "/home/usr/.conda/envs/lora/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/usr/.conda/envs/lora/lib/python3.9/site-packages/peft/peft_model.py", line 530, in forward
    return self.base_model(
  File "/home/usr/.conda/envs/lora/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/usr/.conda/envs/lora/lib/python3.9/site-packages/accelerate/hooks.py", line 165, in new_forward
    output = old_forward(*args, **kwargs)
  File "/media/data/6/usr/tmp/transformers/src/transformers/models/llama/modeling_llama.py", line 765, in forward
    outputs = self.model(
  File "/home/usr/.conda/envs/lora/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/media/data/6/usr/tmp/transformers/src/transformers/models/llama/modeling_llama.py", line 614, in forward
    layer_outputs = decoder_layer(
  File "/home/usr/.conda/envs/lora/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/usr/.conda/envs/lora/lib/python3.9/site-packages/accelerate/hooks.py", line 165, in new_forward
    output = old_forward(*args, **kwargs)
  File "/media/data/6/usr/tmp/transformers/src/transformers/models/llama/modeling_llama.py", line 309, in forward
    hidden_states, self_attn_weights, present_key_value = self.self_attn(
  File "/home/usr/.conda/envs/lora/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/usr/.conda/envs/lora/lib/python3.9/site-packages/accelerate/hooks.py", line 165, in new_forward
    output = old_forward(*args, **kwargs)
  File "/media/data/6/usr/tmp/transformers/src/transformers/models/llama/modeling_llama.py", line 209, in forward
    query_states = self.q_proj(hidden_states).view(bsz, q_len, self.num_heads, self.head_dim).transpose(1, 2)
  File "/home/usr/.conda/envs/lora/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/usr/.conda/envs/lora/lib/python3.9/site-packages/peft/tuners/lora.py", line 350, in forward
    result += self.lora_B(self.lora_A(self.lora_dropout(x))) * self.scaling
  File "/home/usr/.conda/envs/lora/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/usr/.conda/envs/lora/lib/python3.9/site-packages/torch/nn/modules/linear.py", line 114, in forward
    return F.linear(input, self.weight, self.bias)
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu! (when checking argument for argument mat2 in method wrapper_CUDA_mm)

I hope for help, thanks!

Mar 29 '23 09:03 AAAZSF

I successfully finetune the 30b model on multi gpu by pipeline parallelism But when i set load_in_8bit=False，it cause RuntimeError:

  File "/home/usr/project/alpaca-lora/finetune.py", line 288, in <module>
    fire.Fire(train)
  File "/home/usr/.conda/envs/lora/lib/python3.9/site-packages/fire/core.py", line 141, in Fire
    component_trace = _Fire(component, args, parsed_flag_args, context, name)
  File "/home/usr/.conda/envs/lora/lib/python3.9/site-packages/fire/core.py", line 475, in _Fire
    component, remaining_args = _CallAndUpdateTrace(
  File "/home/usr/.conda/envs/lora/lib/python3.9/site-packages/fire/core.py", line 691, in _CallAndUpdateTrace
    component = fn(*varargs, **kwargs)
  File "/home/usr/project/alpaca-lora/finetune.py", line 255, in train
    trainer.train(resume_from_checkpoint=resume_from_checkpoint)
  File "/media/data/6/usr/tmp/transformers/src/transformers/trainer.py", line 1636, in train
    return inner_training_loop(
  File "/media/data/6/usr/tmp/transformers/src/transformers/trainer.py", line 1903, in _inner_training_loop
    tr_loss_step = self.training_step(model, inputs)
  File "/media/data/6/usr/tmp/transformers/src/transformers/trainer.py", line 2649, in training_step
    loss = self.compute_loss(model, inputs)
  File "/media/data/6/usr/tmp/transformers/src/transformers/trainer.py", line 2681, in compute_loss
    outputs = model(**inputs)
  File "/home/usr/.conda/envs/lora/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/usr/.conda/envs/lora/lib/python3.9/site-packages/peft/peft_model.py", line 530, in forward
    return self.base_model(
  File "/home/usr/.conda/envs/lora/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/usr/.conda/envs/lora/lib/python3.9/site-packages/accelerate/hooks.py", line 165, in new_forward
    output = old_forward(*args, **kwargs)
  File "/media/data/6/usr/tmp/transformers/src/transformers/models/llama/modeling_llama.py", line 765, in forward
    outputs = self.model(
  File "/home/usr/.conda/envs/lora/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/media/data/6/usr/tmp/transformers/src/transformers/models/llama/modeling_llama.py", line 614, in forward
    layer_outputs = decoder_layer(
  File "/home/usr/.conda/envs/lora/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/usr/.conda/envs/lora/lib/python3.9/site-packages/accelerate/hooks.py", line 165, in new_forward
    output = old_forward(*args, **kwargs)
  File "/media/data/6/usr/tmp/transformers/src/transformers/models/llama/modeling_llama.py", line 309, in forward
    hidden_states, self_attn_weights, present_key_value = self.self_attn(
  File "/home/usr/.conda/envs/lora/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/usr/.conda/envs/lora/lib/python3.9/site-packages/accelerate/hooks.py", line 165, in new_forward
    output = old_forward(*args, **kwargs)
  File "/media/data/6/usr/tmp/transformers/src/transformers/models/llama/modeling_llama.py", line 209, in forward
    query_states = self.q_proj(hidden_states).view(bsz, q_len, self.num_heads, self.head_dim).transpose(1, 2)
  File "/home/usr/.conda/envs/lora/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/usr/.conda/envs/lora/lib/python3.9/site-packages/peft/tuners/lora.py", line 350, in forward
    result += self.lora_B(self.lora_A(self.lora_dropout(x))) * self.scaling
  File "/home/usr/.conda/envs/lora/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/usr/.conda/envs/lora/lib/python3.9/site-packages/torch/nn/modules/linear.py", line 114, in forward
    return F.linear(input, self.weight, self.bias)
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu! (when checking argument for argument mat2 in method wrapper_CUDA_mm)

I hope for help, thanks!

Can you fit fp16 model in your VRAM? It seems you don't have enough vram and some layers are put on cpu.

Mar 29 '23 09:03 sgsdxzy

I successfully finetune the 30b model on multi gpu by pipeline parallelism But when i set load_in_8bit=False，it cause RuntimeError:

  File "/home/usr/project/alpaca-lora/finetune.py", line 288, in <module>
    fire.Fire(train)
  File "/home/usr/.conda/envs/lora/lib/python3.9/site-packages/fire/core.py", line 141, in Fire
    component_trace = _Fire(component, args, parsed_flag_args, context, name)
  File "/home/usr/.conda/envs/lora/lib/python3.9/site-packages/fire/core.py", line 475, in _Fire
    component, remaining_args = _CallAndUpdateTrace(
  File "/home/usr/.conda/envs/lora/lib/python3.9/site-packages/fire/core.py", line 691, in _CallAndUpdateTrace
    component = fn(*varargs, **kwargs)
  File "/home/usr/project/alpaca-lora/finetune.py", line 255, in train
    trainer.train(resume_from_checkpoint=resume_from_checkpoint)
  File "/media/data/6/usr/tmp/transformers/src/transformers/trainer.py", line 1636, in train
    return inner_training_loop(
  File "/media/data/6/usr/tmp/transformers/src/transformers/trainer.py", line 1903, in _inner_training_loop
    tr_loss_step = self.training_step(model, inputs)
  File "/media/data/6/usr/tmp/transformers/src/transformers/trainer.py", line 2649, in training_step
    loss = self.compute_loss(model, inputs)
  File "/media/data/6/usr/tmp/transformers/src/transformers/trainer.py", line 2681, in compute_loss
    outputs = model(**inputs)
  File "/home/usr/.conda/envs/lora/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/usr/.conda/envs/lora/lib/python3.9/site-packages/peft/peft_model.py", line 530, in forward
    return self.base_model(
  File "/home/usr/.conda/envs/lora/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/usr/.conda/envs/lora/lib/python3.9/site-packages/accelerate/hooks.py", line 165, in new_forward
    output = old_forward(*args, **kwargs)
  File "/media/data/6/usr/tmp/transformers/src/transformers/models/llama/modeling_llama.py", line 765, in forward
    outputs = self.model(
  File "/home/usr/.conda/envs/lora/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/media/data/6/usr/tmp/transformers/src/transformers/models/llama/modeling_llama.py", line 614, in forward
    layer_outputs = decoder_layer(
  File "/home/usr/.conda/envs/lora/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/usr/.conda/envs/lora/lib/python3.9/site-packages/accelerate/hooks.py", line 165, in new_forward
    output = old_forward(*args, **kwargs)
  File "/media/data/6/usr/tmp/transformers/src/transformers/models/llama/modeling_llama.py", line 309, in forward
    hidden_states, self_attn_weights, present_key_value = self.self_attn(
  File "/home/usr/.conda/envs/lora/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/usr/.conda/envs/lora/lib/python3.9/site-packages/accelerate/hooks.py", line 165, in new_forward
    output = old_forward(*args, **kwargs)
  File "/media/data/6/usr/tmp/transformers/src/transformers/models/llama/modeling_llama.py", line 209, in forward
    query_states = self.q_proj(hidden_states).view(bsz, q_len, self.num_heads, self.head_dim).transpose(1, 2)
  File "/home/usr/.conda/envs/lora/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/usr/.conda/envs/lora/lib/python3.9/site-packages/peft/tuners/lora.py", line 350, in forward
    result += self.lora_B(self.lora_A(self.lora_dropout(x))) * self.scaling
  File "/home/usr/.conda/envs/lora/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/usr/.conda/envs/lora/lib/python3.9/site-packages/torch/nn/modules/linear.py", line 114, in forward
    return F.linear(input, self.weight, self.bias)
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu! (when checking argument for argument mat2 in method wrapper_CUDA_mm)

I hope for help, thanks!

Can you fit fp16 model in your VRAM? It seems you don't have enough vram and some layers are put on cpu.

Sorry, I forgot to say that I set load_in_8bit=False in the 7b model. I test 7b fp16 model on 2*24G gpus, so i think memory is enough. more detail during runing: nvidia-smi

>>> model.hf_device_map
{'model.embed_tokens': 0, 'model.layers.0': 0, 'model.layers.1': 0, 'model.layers.2': 0, 'model.layers.3': 0, 'model.layers.4': 0, 'model.layers.5': 0, 'model.layers.6': 0, 'model.layers.7': 0, 'model.layers.8': 0, 'model.layers.9': 0, 'model.layers.10': 0, 'model.layers.11': 0, 'model.layers.12': 0, 'model.layers.13': 0, 'model.layers.14': 0, 'model.layers.15': 0, 'model.layers.16': 1, 'model.layers.17': 1, 'model.layers.18': 1, 'model.layers.19': 1, 'model.layers.20': 1, 'model.layers.21': 1, 'model.layers.22': 1, 'model.layers.23': 1, 'model.layers.24': 1, 'model.layers.25': 1, 'model.layers.26': 1, 'model.layers.27': 1, 'model.layers.28': 1, 'model.layers.29': 1, 'model.layers.30': 1, 'model.layers.31': 1, 'model.norm': 1, 'lm_head': 1}

Mar 29 '23 10:03 AAAZSF

Make sure you guys has something like model.parallized = True (check the changed files) or your model will blow up

and this error is not caused by OOM

Mar 29 '23 14:03 KohakuBlueleaf

Hi,

Thanks for this work!

I'm experimenting multiple configs to find the best matches for my use cases. Linux, 2x3090. I'm able to train 7b and 13b on both of them with ddp.

I'm now trying to train the 30b, but I keep getting OOM.

transformers was upgraded

world_size to 1, I made sure ddp was off.

2 gpus detected

I tried to force 15GB/15GB as max_memory

Still, while "Loading checkpoint shards:" it breaks with OOM, having filled the first GPU up, second one is almost unused.

Any idea what I could do wrong?

Are you using torchrun?

Apr 14 '23 19:04 RunhuiWang

Does as it says on the tin. Now multi-gpu users can choose to use them for faster training (DDP) or bigger models (MP)

This required a minor change to the transformers library. It has been merged: PR. Just update by reinstalling the transformers module.

This also serves as a workaround for #8

Could you provide a command line example that uses model parallelism on multiple GPU? I have tried

CUDA_VISIBLE_DEVICES=0,1 python finetune.py --base_model '/data/980pro2tb/LLAMA-hf/30B' --data_path 'yahma/alpaca-cleaned' --output_dir './lora-alpaca'

The model was split into two GPUs about evenly, but I got the error "../aten/src/ATen/native/cuda/ScatterGatherKernel.cu:144: operator(): block: [30,0,0], thread: [96,0,0] Assertion idx_dim >= 0 && idx_dim < index_size && "index out of bounds" failed."

If I train a smaller model on a single GPU, then the error won't show up. DDP also works well on 2 GPUs.

I have also tried different versions of CUDA, nvidia-drivers, transformers, bitsandbytes, and llama models (even converted from original weights) but this error is still here.

Apr 14 '23 20:04 RunhuiWang

Yeah... this is new. It was also reported here: https://github.com/huggingface/transformers/issues/22546

One guy there noticed the only difference was his driver version: https://github.com/huggingface/transformers/issues/22546#issuecomment-1498348442

I haven't seen it yet, but I haven't been training recently. I may have some time to check it out this weekend, but it's likely beyond my knowledge.

Apr 14 '23 22:04 kooshi

Thanks for pointing me to that thread. I forgot to mention that I was using 4090s. I have also checked that thread earlier and tried his driver version, but no luck on 4090s. MP works well on 2x 3090s though.

Apr 14 '23 22:04 RunhuiWang

Yeah... this is new. It was also reported here: huggingface/transformers#22546

One guy there noticed the only difference was his driver version: huggingface/transformers#22546 (comment)

I haven't seen it yet, but I haven't been training recently. I may have some time to check it out this weekend, but it's likely beyond my knowledge.

I think this issue might be related to ubuntu 22.04. I downgraded my system to ubuntu 20.04 and everything works fine. Thanks a lot for your effort in this project!

Apr 18 '23 21:04 RunhuiWang

Does as it says on the tin. Now multi-gpu users can choose to use them for faster training (DDP) or bigger models (MP)

This required a minor change to the transformers library. It has been merged: PR. Just update by reinstalling the transformers module.

This also serves as a workaround for #8

So here is faster training (DDP) "OR" bigger models (MP). I have been searching ways for ddp "AND" mp but have no luck by now. Neither Deepspeed nor torchrun can give a clear clue.

Apr 25 '23 16:04 kongbohu

Does as it says on the tin. Now multi-gpu users can choose to use them for faster training (DDP) or bigger models (MP) This required a minor change to the transformers library. It has been merged: PR. Just update by reinstalling the transformers module. This also serves as a workaround for #8

So here is faster training (DDP) "OR" bigger models (MP). I have been searching ways for ddp "AND" mp but have no luck by now. Neither Deepspeed nor torchrun can give a clear clue.

deepspeed do support MP, but seems only in inference part -- hope someone could correct me if I were wrong

Apr 25 '23 16:04 kongbohu

alpaca-lora alpaca-lora copied to clipboard

Enabling model parallelism (training 30b on 2x 3090s and beyond)

alpaca-lora
alpaca-lora copied to clipboard