peft
peft copied to clipboard
Int8 training example error with multi-gpu - flan-t5
RuntimeError: module must have its parameters and buffers on device cuda:0 (device_ids[0]) but found one of them on device: cuda:1
hi @djaym7 Can you please share a reproducible script? Thanks!
Rename to temp.ipynb .
To reproduce, option 2 is to comment out os.environ['CUDA_VISIBLE_DEVICES']='0' line in this example https://github.com/huggingface/peft/blob/main/examples/int8_training/Finetune_flan_t5_large_bnb_peft.ipynb and run it on multi-gpu instance.
Any update on this? @djaym7 did you manage to fix it?
No I didn't
hey @djaym7 @macabdul9 In multi-GPU setup you need to add the following:
setattr(model, 'model_parallel', True)
setattr(model, 'is_parallelizable', True)
right after creating the model. This should solve your issue
@younesbelkada @pacman100 the above mentioned change is not working. I am getting following error:
AttributeError:'T5Stack' object has no attribute 'first_device'.
Can you check once?
@pacman100 I believe because T5 still uses the old parallelize
API. Could you try to just add setattr(model, 'model_parallel', True)
instead of the 2 lines?
@younesbelkada still same error after adding just setattr(model, 'model_parallel', True)
Hi @djaym7 @Shreyans92 I managed to reproduce, https://github.com/huggingface/transformers/pull/22532 and another PR that I will share soon will fix these bugs
Hi @djaym7 @Shreyans92 @macabdul9 ,
Now if you install transformers
from source, T5 multi-GPU should work!
pip install git+https://github.com/huggingface/transformers.git
The script that I have run is the script below:
import os
import torch
from datasets import load_dataset
from transformers import AutoModelForCausalLM, AutoModelForSeq2SeqLM, AutoTokenizer, Trainer, TrainingArguments, DataCollatorForLanguageModeling
from peft import prepare_model_for_int8_training,LoraConfig, get_peft_model
model_id = "google/flan-t5-base"
model = AutoModelForSeq2SeqLM.from_pretrained(
model_id,
load_in_8bit=True,
device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained(causal_lm_model_id)
model = prepare_model_for_int8_training(model)
config = LoraConfig(
r=16,
lora_alpha=32,
target_modules=["q", "v"],
lora_dropout=0.05,
bias="none",
task_type="CAUSAL_LM",
)
model = get_peft_model(model, config)
data = load_dataset("Abirate/english_quotes")
data = data.map(lambda samples: tokenizer(samples["quote"]), batched=True)
trainer = Trainer(
model=model,
train_dataset=data["train"],
args=TrainingArguments(
per_device_train_batch_size=4,
gradient_accumulation_steps=4,
warmup_steps=2,
max_steps=3,
learning_rate=2e-4,
fp16=True,
logging_steps=1,
output_dir="outputs",
),
data_collator=DataCollatorForLanguageModeling(tokenizer, mlm=False),
)
model.config.use_cache = False
trainer.train()
Note that now there is no need to call
setattr(model, 'model_parallel', True)
setattr(model, 'is_parallelizable', True)
anymore
@younesbelkada thanks for the fix. I tried running the same and have one observation. When i am running the flan-t5-xxl 8 bit finetuning on g5.8x.large it is showing 27hrs to complete for an epoch but when i am moving to g5.12x.large it shows 35 hrs to complete the 1 epoch on the same dataset. I thought multi-gpu training will be faster. Is this expected?
Hi @Shreyans92
I think it's because CPU offloading might be enabled for some reason, could you try to print model.hf_device_map
?
Hi @younesbelkada Here is the device_map:
{'shared': 0, 'decoder.embed_tokens': 0, 'encoder.embed_tokens': 0, 'encoder.block.0': 0, 'encoder.block.1': 0, 'encoder.block.2': 0, 'encoder.block.3': 0, 'encoder.block.4': 0, 'encoder.block.5': 0, 'encoder.block.6': 0, 'encoder.block.7': 0, 'encoder.block.8': 0, 'encoder.block.9': 0, 'encoder.block.10': 0, 'encoder.block.11': 0, 'encoder.block.12': 1, 'encoder.block.13': 1, 'encoder.block.14': 1, 'encoder.block.15': 1, 'encoder.block.16': 1, 'encoder.block.17': 1, 'encoder.block.18': 1, 'encoder.block.19': 1, 'encoder.block.20': 1, 'encoder.block.21': 1, 'encoder.block.22': 1, 'encoder.block.23': 1, 'encoder.final_layer_norm': 1, 'encoder.dropout': 1, 'decoder.block.0': 1, 'decoder.block.1': 1, 'decoder.block.2': 2, 'decoder.block.3': 2, 'decoder.block.4': 2, 'decoder.block.5': 2, 'decoder.block.6': 2, 'decoder.block.7': 2, 'decoder.block.8': 2, 'decoder.block.9': 2, 'decoder.block.10': 2, 'decoder.block.11': 2, 'decoder.block.12': 2, 'decoder.block.13': 2, 'decoder.block.14': 3, 'decoder.block.15': 3, 'decoder.block.16': 3, 'decoder.block.17': 3, 'decoder.block.18': 3, 'decoder.block.19': 3, 'decoder.block.20': 3, 'decoder.block.21': 3, 'decoder.block.22': 3, 'decoder.block.23': 3, 'decoder.final_layer_norm': 3, 'decoder.dropout': 3, 'lm_head': 3}
Looks Like everything is on gpu only.
On Tue, Apr 4, 2023 at 9:18 PM Younes Belkada @.***> wrote:
Hi @Shreyans92 https://github.com/Shreyans92 I think it's because CPU offloading might be enabled for some reason, could you try to print model.hf_device_map ?
— Reply to this email directly, view it on GitHub https://github.com/huggingface/peft/issues/205#issuecomment-1496213378, or unsubscribe https://github.com/notifications/unsubscribe-auth/AEO3R7LKCK3OV4ZVRVO6RV3W7Q7GXANCNFSM6AAAAAAWCV5YTE . You are receiving this because you were mentioned.Message ID: @.***>
Hello @Shreyans92,
Multi-GPU in this case is a sort of model parallelism wherein different layers are on different GPUs, also called naive pipelining. In this case, only one GPU is doing the computation while others are idling, hence naive
, this allows for fitting big models but at the cost of inefficiency. Hence, the communication overheads of copying data/activations across different devices and the fact that only one GPU is computing while others are idling, it is likely the cause of the slowness you are observing in comparison to the case when the entire model is on the single 24GB GPU (g5.8x.large).
If you can fit the model on a single GPU, you can do Distributed Data Parallelism by having device_map = {"":device}
where the device is the GPU number. That way, num_gpus*batch_size is processed simultaneously and all the GPUs are computing all the time. @younesbelkada can provide an example script I think as he has successfully got this working. This is how multi-gpu can accelerate training.
pacman100 @pacman100 ,hi~ I wonder if it possible to do naive pipelining across 2 or more nodes (multi-host)
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
whisper:ValueError: You can't train a model that has been loaded in 8-bit precision on multiple devices