alpaca-lora
alpaca-lora copied to clipboard
Enabling model parallelism (training 30b on 2x 3090s and beyond)
Does as it says on the tin. Now multi-gpu users can choose to use them for faster training (DDP) or bigger models (MP)
This required a minor change to the transformers library. It has been merged: PR. Just update by reinstalling the transformers module.
This also serves as a workaround for #8
Perhaps it doesn't work on windows?
Fresh install of this fork, with MICRO_BATCH_SIZE = 64
, only uses a single GPU, even though gpus = torch.cuda.device_count()
correctly detects 2 GPUs.
ddp
remains False
because WORLD_SIZE is not set.
Setting PIPE_CHUNKS = 1
gives an error:
Traceback (most recent call last):
File "E:\LLaMA-train\alpaca-lora\finetune.py", line 142, in <module>
from torch.distributed.pipeline.sync import Pipe
File "E:\LLaMA-train\installer_files\env\lib\site-packages\torch\distributed\pipeline\sync\__init__.py", line 9, in <module>
from .pipe import Pipe, WithDevice
File "E:\LLaMA-train\installer_files\env\lib\site-packages\torch\distributed\pipeline\sync\pipe.py", line 13, in <module>
from torch.distributed.rpc import RRef
ImportError: cannot import name 'RRef' from 'torch.distributed.rpc' (E:\LLaMA-train\installer_files\env\lib\site-packages\torch\distributed\rpc\__init__.py)
Here are some relevant libs (transformers is also installed from the correct fork git+https://github.com/kooshi/transformers.git@llama-parallelism
):
tokenizers 0.13.2
torch 2.0.0
torchaudio 2.0.0
torchvision 0.15.0
tqdm 4.65.0
transformers 4.28.0.dev0
Hm, maybe the rpc bit isn't supported on windows. Keep PIPE_CHUNKS = 0
, and try manually setting max_memory
as below.
device_map = "auto"
is supposed to distribute the model across gpus, but for some reason, when loading in 8bit, it simply doesn't. You can force the behavior with max_memory
.
Each gpu should have roughly (size of model) / gpus.
So for 30b at 8bit on 2gpus it should be max_memory={0: "15GB", 1: "15GB"}
for 13b, max_memory={0: "7GB", 1: "7GB"}
and so on. It's unfortunate that you need to fiddle with it manually... maybe I'll look into why 8bit causes device_map to be ignored tonight.
Since you're on Windows, use MSI Afterburner or something similar to monitor the VRAM usage to make sure the model is loaded across both before training begins. For anyone on Linux, use nvtop.
That kind of worked.
Yeah, that's what you should expect to see. Model parallelism can't load both GPUs all the time. It just goes back and forth between them, but it means you can load larger models or use larger batch sizes. Some new frameworks have some fancy tricks to reduce the inefficiency, but this is just the simplest case. See here for more info: https://pytorch.org/docs/stable/pipeline.html
The Pipeline should be one of those tricks, but I don't think I fully implemented it. I'll need to play with that some more later.
But for now, yeah, looks like it's working as expected for you.
I found the root cause of the device_map auto discrepancy in the transformers repo, so I'm going to draft this until I get that fixed and merged
It is possible to support Deepspeed stage 3 to do parameter partitioning to fit large model into multiple gpus? Will it be faster than the currently naive (non-overlapping) pipeline parallelism?
I'm not sure, I'll need to look into Deepspeed more. I had played with it for a minute and I think it didn't support 8bit. I'll add it to my list of things to look at, because better parallelism would be awesome. I mostly know how to get full pipelining working, but Deepspeed would be more valuable.
How are you planning to implement full pipeline? I searched for examples and docs, and I think all leads to modifying the implementation of llama in transformers, which I would recognize as last resort.
Correct. It's not trivial, but it's not terrible either. I started working on a quick and dirty experiment of it the other day. It's in the llama-parallelism branch of my transformers fork.
I stopped when I realized I also need to batch the inputs into microbatches in a single tensor. I was also using the pipeline for a little more than what it was designed for, so it was breaking in weird ways. Please do take a look if you're interested though, be warned it's very hacky and broken.
Hi,
Thanks for this work!
I'm experimenting multiple configs to find the best matches for my use cases. Linux, 2x3090. I'm able to train 7b and 13b on both of them with ddp.
I'm now trying to train the 30b, but I keep getting OOM.
- transformers was upgraded
- world_size to 1, I made sure ddp was off.
- 2 gpus detected
- I tried to force 15GB/15GB as max_memory
Still, while "Loading checkpoint shards:" it breaks with OOM, having filled the first GPU up, second one is almost unused.
Any idea what I could do wrong?
@AngainorDev how did you force max_memory? I edited finetune.py line 78 to be
model = LlamaForCausalLM.from_pretrained(
base_model,
load_in_8bit=True,
device_map=device_map,
max_memory={0: "11776MB", 1: "11776MB", 2: "11776MB"}
)
And I can train 30B on 2080Ti 22G x 3 withmicro_batch_size=16
. But one epoch would take >30h because naive model parallel training is very inefficient.
@AngainorDev I just pushed a change that references my fork of transformers. I was hoping they would merge the PR in quickly, but since they're a company, it seems like they won't get to it till Monday. To install it,
git pull
pip uninstall transformers
pip install -r transformers.txt
With that, you won't need to use a hard coded max_memory, and you can just use "auto" device map for a perfect distribution.
Correct. It's not trivial, but it's not terrible either. I started working on a quick and dirty experiment of it the other day. It's in the llama-parallelism branch of my transformers fork.
I stopped when I realized I also need to batch the inputs into microbatches in a single tensor. I was also using the pipeline for a little more than what it was designed for, so it was breaking in weird ways. Please do take a look if you're interested though, be warned it's very hacky and broken.
I find deepspeed pipeline parallelism very promising: you just need to change the input and output of each layer to tuple of tensor, and deepspeed can do the rest for you, including micro batching, etc. It has much relaxed constrants than pytorch pipe: you don't need to express the model in nn.Sequential (just a list of python callables), each layer does not need to be a nn.Module (any python callable), and the input/output can be a tuple of tensors, not limited to one tensor. Because Llama has only one layer type: LlamaDecoderLayer, I thinks it could be relatively easy to wrap the layer in a wrapper that simply pack and unpack parameters as tuples. Are you interested in implementing this? I might try to do it as well, but I am new to ML (I just installed torch weeks ago) so it might take me a long time before it can work.
I just updated to git+https://github.com/kooshi/transformers.git@balanced_memory_8bit
how did you force max_memory? I edited finetune.py line 78 to be
I used max_memory={0: "15GB", 1: "15GB"},
This seems to have no effect, gpu 0 taking it all and oom at 70% of model loading.
Sounds like it's not even seeing the second gpu as available or something. Make sure CUDA_VISIBLE_DEVICES is set correctly.
Yeah,
But torch.cuda.device_count() correctly detects the 2 GPUs.
CUDA_VISIBLE_DEVICES
was not set, I explicitely set it to CUDA_VISIBLE_DEVICES=0,1
, no change.
Second one gets a bit of vram used when running, around 1GB. Both are successfully used with ddp on smaller models.
My second PR for transformers was merged in, so now the only thing required to use model parallelism is reinstalling transformers, and merging the few lines left. I'm not sure what's going on with @AngainorDev, because in that case it's behaving as if it's ignoring the proven fixes of the manual max_memory, or the updated load_in_8bit logic in transformers. I have to imagine something is configured incorrectly or is somehow overriding the correct behavior.
@AngainorDev my next suggestion would be to attempt a clean slate. Set up a brand new conda environment, install the latest supported libraries, and run this code, unmodified, just to see if it can work at all before changes.
This PR is ready to be merged.
Thanks for the follow up. Agree, something could be broken in my setup, I'll do from a clean one next time I'll try, thanks!
Already use this update to train with MP and it works well! I train 13B model with 2x3090 with cutoff len 512 + batch size 24
I successfully finetune the 30b model on multi gpu by pipeline parallelism
But when i set load_in_8bit=False
,it cause RuntimeError:
File "/home/usr/project/alpaca-lora/finetune.py", line 288, in <module>
fire.Fire(train)
File "/home/usr/.conda/envs/lora/lib/python3.9/site-packages/fire/core.py", line 141, in Fire
component_trace = _Fire(component, args, parsed_flag_args, context, name)
File "/home/usr/.conda/envs/lora/lib/python3.9/site-packages/fire/core.py", line 475, in _Fire
component, remaining_args = _CallAndUpdateTrace(
File "/home/usr/.conda/envs/lora/lib/python3.9/site-packages/fire/core.py", line 691, in _CallAndUpdateTrace
component = fn(*varargs, **kwargs)
File "/home/usr/project/alpaca-lora/finetune.py", line 255, in train
trainer.train(resume_from_checkpoint=resume_from_checkpoint)
File "/media/data/6/usr/tmp/transformers/src/transformers/trainer.py", line 1636, in train
return inner_training_loop(
File "/media/data/6/usr/tmp/transformers/src/transformers/trainer.py", line 1903, in _inner_training_loop
tr_loss_step = self.training_step(model, inputs)
File "/media/data/6/usr/tmp/transformers/src/transformers/trainer.py", line 2649, in training_step
loss = self.compute_loss(model, inputs)
File "/media/data/6/usr/tmp/transformers/src/transformers/trainer.py", line 2681, in compute_loss
outputs = model(**inputs)
File "/home/usr/.conda/envs/lora/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/home/usr/.conda/envs/lora/lib/python3.9/site-packages/peft/peft_model.py", line 530, in forward
return self.base_model(
File "/home/usr/.conda/envs/lora/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/home/usr/.conda/envs/lora/lib/python3.9/site-packages/accelerate/hooks.py", line 165, in new_forward
output = old_forward(*args, **kwargs)
File "/media/data/6/usr/tmp/transformers/src/transformers/models/llama/modeling_llama.py", line 765, in forward
outputs = self.model(
File "/home/usr/.conda/envs/lora/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/media/data/6/usr/tmp/transformers/src/transformers/models/llama/modeling_llama.py", line 614, in forward
layer_outputs = decoder_layer(
File "/home/usr/.conda/envs/lora/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/home/usr/.conda/envs/lora/lib/python3.9/site-packages/accelerate/hooks.py", line 165, in new_forward
output = old_forward(*args, **kwargs)
File "/media/data/6/usr/tmp/transformers/src/transformers/models/llama/modeling_llama.py", line 309, in forward
hidden_states, self_attn_weights, present_key_value = self.self_attn(
File "/home/usr/.conda/envs/lora/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/home/usr/.conda/envs/lora/lib/python3.9/site-packages/accelerate/hooks.py", line 165, in new_forward
output = old_forward(*args, **kwargs)
File "/media/data/6/usr/tmp/transformers/src/transformers/models/llama/modeling_llama.py", line 209, in forward
query_states = self.q_proj(hidden_states).view(bsz, q_len, self.num_heads, self.head_dim).transpose(1, 2)
File "/home/usr/.conda/envs/lora/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/home/usr/.conda/envs/lora/lib/python3.9/site-packages/peft/tuners/lora.py", line 350, in forward
result += self.lora_B(self.lora_A(self.lora_dropout(x))) * self.scaling
File "/home/usr/.conda/envs/lora/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/home/usr/.conda/envs/lora/lib/python3.9/site-packages/torch/nn/modules/linear.py", line 114, in forward
return F.linear(input, self.weight, self.bias)
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu! (when checking argument for argument mat2 in method wrapper_CUDA_mm)
I hope for help, thanks!
I successfully finetune the 30b model on multi gpu by pipeline parallelism But when i set
load_in_8bit=False
,it cause RuntimeError:File "/home/usr/project/alpaca-lora/finetune.py", line 288, in <module> fire.Fire(train) File "/home/usr/.conda/envs/lora/lib/python3.9/site-packages/fire/core.py", line 141, in Fire component_trace = _Fire(component, args, parsed_flag_args, context, name) File "/home/usr/.conda/envs/lora/lib/python3.9/site-packages/fire/core.py", line 475, in _Fire component, remaining_args = _CallAndUpdateTrace( File "/home/usr/.conda/envs/lora/lib/python3.9/site-packages/fire/core.py", line 691, in _CallAndUpdateTrace component = fn(*varargs, **kwargs) File "/home/usr/project/alpaca-lora/finetune.py", line 255, in train trainer.train(resume_from_checkpoint=resume_from_checkpoint) File "/media/data/6/usr/tmp/transformers/src/transformers/trainer.py", line 1636, in train return inner_training_loop( File "/media/data/6/usr/tmp/transformers/src/transformers/trainer.py", line 1903, in _inner_training_loop tr_loss_step = self.training_step(model, inputs) File "/media/data/6/usr/tmp/transformers/src/transformers/trainer.py", line 2649, in training_step loss = self.compute_loss(model, inputs) File "/media/data/6/usr/tmp/transformers/src/transformers/trainer.py", line 2681, in compute_loss outputs = model(**inputs) File "/home/usr/.conda/envs/lora/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl return forward_call(*args, **kwargs) File "/home/usr/.conda/envs/lora/lib/python3.9/site-packages/peft/peft_model.py", line 530, in forward return self.base_model( File "/home/usr/.conda/envs/lora/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl return forward_call(*args, **kwargs) File "/home/usr/.conda/envs/lora/lib/python3.9/site-packages/accelerate/hooks.py", line 165, in new_forward output = old_forward(*args, **kwargs) File "/media/data/6/usr/tmp/transformers/src/transformers/models/llama/modeling_llama.py", line 765, in forward outputs = self.model( File "/home/usr/.conda/envs/lora/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl return forward_call(*args, **kwargs) File "/media/data/6/usr/tmp/transformers/src/transformers/models/llama/modeling_llama.py", line 614, in forward layer_outputs = decoder_layer( File "/home/usr/.conda/envs/lora/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl return forward_call(*args, **kwargs) File "/home/usr/.conda/envs/lora/lib/python3.9/site-packages/accelerate/hooks.py", line 165, in new_forward output = old_forward(*args, **kwargs) File "/media/data/6/usr/tmp/transformers/src/transformers/models/llama/modeling_llama.py", line 309, in forward hidden_states, self_attn_weights, present_key_value = self.self_attn( File "/home/usr/.conda/envs/lora/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl return forward_call(*args, **kwargs) File "/home/usr/.conda/envs/lora/lib/python3.9/site-packages/accelerate/hooks.py", line 165, in new_forward output = old_forward(*args, **kwargs) File "/media/data/6/usr/tmp/transformers/src/transformers/models/llama/modeling_llama.py", line 209, in forward query_states = self.q_proj(hidden_states).view(bsz, q_len, self.num_heads, self.head_dim).transpose(1, 2) File "/home/usr/.conda/envs/lora/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl return forward_call(*args, **kwargs) File "/home/usr/.conda/envs/lora/lib/python3.9/site-packages/peft/tuners/lora.py", line 350, in forward result += self.lora_B(self.lora_A(self.lora_dropout(x))) * self.scaling File "/home/usr/.conda/envs/lora/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl return forward_call(*args, **kwargs) File "/home/usr/.conda/envs/lora/lib/python3.9/site-packages/torch/nn/modules/linear.py", line 114, in forward return F.linear(input, self.weight, self.bias) RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu! (when checking argument for argument mat2 in method wrapper_CUDA_mm)
I hope for help, thanks!
Can you fit fp16 model in your VRAM? It seems you don't have enough vram and some layers are put on cpu.
I successfully finetune the 30b model on multi gpu by pipeline parallelism But when i set
load_in_8bit=False
,it cause RuntimeError:File "/home/usr/project/alpaca-lora/finetune.py", line 288, in <module> fire.Fire(train) File "/home/usr/.conda/envs/lora/lib/python3.9/site-packages/fire/core.py", line 141, in Fire component_trace = _Fire(component, args, parsed_flag_args, context, name) File "/home/usr/.conda/envs/lora/lib/python3.9/site-packages/fire/core.py", line 475, in _Fire component, remaining_args = _CallAndUpdateTrace( File "/home/usr/.conda/envs/lora/lib/python3.9/site-packages/fire/core.py", line 691, in _CallAndUpdateTrace component = fn(*varargs, **kwargs) File "/home/usr/project/alpaca-lora/finetune.py", line 255, in train trainer.train(resume_from_checkpoint=resume_from_checkpoint) File "/media/data/6/usr/tmp/transformers/src/transformers/trainer.py", line 1636, in train return inner_training_loop( File "/media/data/6/usr/tmp/transformers/src/transformers/trainer.py", line 1903, in _inner_training_loop tr_loss_step = self.training_step(model, inputs) File "/media/data/6/usr/tmp/transformers/src/transformers/trainer.py", line 2649, in training_step loss = self.compute_loss(model, inputs) File "/media/data/6/usr/tmp/transformers/src/transformers/trainer.py", line 2681, in compute_loss outputs = model(**inputs) File "/home/usr/.conda/envs/lora/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl return forward_call(*args, **kwargs) File "/home/usr/.conda/envs/lora/lib/python3.9/site-packages/peft/peft_model.py", line 530, in forward return self.base_model( File "/home/usr/.conda/envs/lora/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl return forward_call(*args, **kwargs) File "/home/usr/.conda/envs/lora/lib/python3.9/site-packages/accelerate/hooks.py", line 165, in new_forward output = old_forward(*args, **kwargs) File "/media/data/6/usr/tmp/transformers/src/transformers/models/llama/modeling_llama.py", line 765, in forward outputs = self.model( File "/home/usr/.conda/envs/lora/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl return forward_call(*args, **kwargs) File "/media/data/6/usr/tmp/transformers/src/transformers/models/llama/modeling_llama.py", line 614, in forward layer_outputs = decoder_layer( File "/home/usr/.conda/envs/lora/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl return forward_call(*args, **kwargs) File "/home/usr/.conda/envs/lora/lib/python3.9/site-packages/accelerate/hooks.py", line 165, in new_forward output = old_forward(*args, **kwargs) File "/media/data/6/usr/tmp/transformers/src/transformers/models/llama/modeling_llama.py", line 309, in forward hidden_states, self_attn_weights, present_key_value = self.self_attn( File "/home/usr/.conda/envs/lora/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl return forward_call(*args, **kwargs) File "/home/usr/.conda/envs/lora/lib/python3.9/site-packages/accelerate/hooks.py", line 165, in new_forward output = old_forward(*args, **kwargs) File "/media/data/6/usr/tmp/transformers/src/transformers/models/llama/modeling_llama.py", line 209, in forward query_states = self.q_proj(hidden_states).view(bsz, q_len, self.num_heads, self.head_dim).transpose(1, 2) File "/home/usr/.conda/envs/lora/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl return forward_call(*args, **kwargs) File "/home/usr/.conda/envs/lora/lib/python3.9/site-packages/peft/tuners/lora.py", line 350, in forward result += self.lora_B(self.lora_A(self.lora_dropout(x))) * self.scaling File "/home/usr/.conda/envs/lora/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl return forward_call(*args, **kwargs) File "/home/usr/.conda/envs/lora/lib/python3.9/site-packages/torch/nn/modules/linear.py", line 114, in forward return F.linear(input, self.weight, self.bias) RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu! (when checking argument for argument mat2 in method wrapper_CUDA_mm)
I hope for help, thanks!
Can you fit fp16 model in your VRAM? It seems you don't have enough vram and some layers are put on cpu.
Sorry, I forgot to say that I set load_in_8bit=False
in the 7b model.
I test 7b fp16 model on 2*24G gpus, so i think memory is enough.
more detail during runing:
nvidia-smi
>>> model.hf_device_map
{'model.embed_tokens': 0, 'model.layers.0': 0, 'model.layers.1': 0, 'model.layers.2': 0, 'model.layers.3': 0, 'model.layers.4': 0, 'model.layers.5': 0, 'model.layers.6': 0, 'model.layers.7': 0, 'model.layers.8': 0, 'model.layers.9': 0, 'model.layers.10': 0, 'model.layers.11': 0, 'model.layers.12': 0, 'model.layers.13': 0, 'model.layers.14': 0, 'model.layers.15': 0, 'model.layers.16': 1, 'model.layers.17': 1, 'model.layers.18': 1, 'model.layers.19': 1, 'model.layers.20': 1, 'model.layers.21': 1, 'model.layers.22': 1, 'model.layers.23': 1, 'model.layers.24': 1, 'model.layers.25': 1, 'model.layers.26': 1, 'model.layers.27': 1, 'model.layers.28': 1, 'model.layers.29': 1, 'model.layers.30': 1, 'model.layers.31': 1, 'model.norm': 1, 'lm_head': 1}
Make sure you guys has something like model.parallized = True (check the changed files) or your model will blow up
and this error is not caused by OOM
Hi,
Thanks for this work!
I'm experimenting multiple configs to find the best matches for my use cases. Linux, 2x3090. I'm able to train 7b and 13b on both of them with ddp.
I'm now trying to train the 30b, but I keep getting OOM.
- transformers was upgraded
- world_size to 1, I made sure ddp was off.
- 2 gpus detected
- I tried to force 15GB/15GB as max_memory
Still, while "Loading checkpoint shards:" it breaks with OOM, having filled the first GPU up, second one is almost unused.
Any idea what I could do wrong?
Are you using torchrun?
Does as it says on the tin. Now multi-gpu users can choose to use them for faster training (DDP) or bigger models (MP)
This required a minor change to the transformers library. It has been merged: PR. Just update by reinstalling the transformers module.
This also serves as a workaround for #8
Could you provide a command line example that uses model parallelism on multiple GPU? I have tried
CUDA_VISIBLE_DEVICES=0,1 python finetune.py --base_model '/data/980pro2tb/LLAMA-hf/30B' --data_path 'yahma/alpaca-cleaned' --output_dir './lora-alpaca'
The model was split into two GPUs about evenly, but I got the error "../aten/src/ATen/native/cuda/ScatterGatherKernel.cu:144: operator(): block: [30,0,0], thread: [96,0,0] Assertion idx_dim >= 0 && idx_dim < index_size && "index out of bounds"
failed."
If I train a smaller model on a single GPU, then the error won't show up. DDP also works well on 2 GPUs.
I have also tried different versions of CUDA, nvidia-drivers, transformers, bitsandbytes, and llama models (even converted from original weights) but this error is still here.
Yeah... this is new. It was also reported here: https://github.com/huggingface/transformers/issues/22546
One guy there noticed the only difference was his driver version: https://github.com/huggingface/transformers/issues/22546#issuecomment-1498348442
I haven't seen it yet, but I haven't been training recently. I may have some time to check it out this weekend, but it's likely beyond my knowledge.
Thanks for pointing me to that thread. I forgot to mention that I was using 4090s. I have also checked that thread earlier and tried his driver version, but no luck on 4090s. MP works well on 2x 3090s though.
Yeah... this is new. It was also reported here: huggingface/transformers#22546
One guy there noticed the only difference was his driver version: huggingface/transformers#22546 (comment)
I haven't seen it yet, but I haven't been training recently. I may have some time to check it out this weekend, but it's likely beyond my knowledge.
I think this issue might be related to ubuntu 22.04. I downgraded my system to ubuntu 20.04 and everything works fine. Thanks a lot for your effort in this project!
Does as it says on the tin. Now multi-gpu users can choose to use them for faster training (DDP) or bigger models (MP)
This required a minor change to the transformers library. It has been merged: PR. Just update by reinstalling the transformers module.
This also serves as a workaround for #8
So here is faster training (DDP) "OR" bigger models (MP). I have been searching ways for ddp "AND" mp but have no luck by now. Neither Deepspeed nor torchrun can give a clear clue.
Does as it says on the tin. Now multi-gpu users can choose to use them for faster training (DDP) or bigger models (MP) This required a minor change to the transformers library. It has been merged: PR. Just update by reinstalling the transformers module. This also serves as a workaround for #8
So here is faster training (DDP) "OR" bigger models (MP). I have been searching ways for ddp "AND" mp but have no luck by now. Neither Deepspeed nor torchrun can give a clear clue.
deepspeed do support MP, but seems only in inference part -- hope someone could correct me if I were wrong