peft Int8 training example error with multi-gpu

RuntimeError: module must have its parameters and buffers on device cuda:0 (device_ids[0]) but found one of them on device: cuda:1

Mar 21 '23 16:03 djaym7

hi @djaym7 Can you please share a reproducible script? Thanks!

Mar 21 '23 17:03 younesbelkada

temp

Rename to temp.ipynb .

Mar 21 '23 17:03 djaym7

To reproduce, option 2 is to comment out os.environ['CUDA_VISIBLE_DEVICES']='0' line in this example https://github.com/huggingface/peft/blob/main/examples/int8_training/Finetune_flan_t5_large_bnb_peft.ipynb and run it on multi-gpu instance.

Mar 21 '23 17:03 djaym7

Any update on this? @djaym7 did you manage to fix it?

Mar 31 '23 01:03 macabdul9

No I didn't

Mar 31 '23 03:03 djaym7

hey @djaym7 @macabdul9 In multi-GPU setup you need to add the following:

setattr(model, 'model_parallel', True)
setattr(model, 'is_parallelizable', True)

right after creating the model. This should solve your issue

Mar 31 '23 07:03 younesbelkada

@younesbelkada @pacman100 the above mentioned change is not working. I am getting following error: AttributeError:'T5Stack' object has no attribute 'first_device'.
Can you check once?

Mar 31 '23 14:03 Shreyans92

@pacman100 I believe because T5 still uses the old parallelize API. Could you try to just add setattr(model, 'model_parallel', True) instead of the 2 lines?

Mar 31 '23 14:03 younesbelkada

@younesbelkada still same error after adding just setattr(model, 'model_parallel', True)

Mar 31 '23 14:03 Shreyans92

Hi @djaym7 @Shreyans92 I managed to reproduce, https://github.com/huggingface/transformers/pull/22532 and another PR that I will share soon will fix these bugs

Apr 03 '23 14:04 younesbelkada

Hi @djaym7 @Shreyans92 @macabdul9 , Now if you install transformers from source, T5 multi-GPU should work!

pip install git+https://github.com/huggingface/transformers.git

The script that I have run is the script below:

import os
import torch
from datasets import load_dataset
from transformers import AutoModelForCausalLM, AutoModelForSeq2SeqLM, AutoTokenizer, Trainer, TrainingArguments, DataCollatorForLanguageModeling

from peft import prepare_model_for_int8_training,LoraConfig, get_peft_model

model_id = "google/flan-t5-base"

model = AutoModelForSeq2SeqLM.from_pretrained(
    model_id,
    load_in_8bit=True,
    device_map="auto",
)

tokenizer = AutoTokenizer.from_pretrained(causal_lm_model_id)
model = prepare_model_for_int8_training(model)

config = LoraConfig(
    r=16, 
    lora_alpha=32, 
    target_modules=["q", "v"], 
    lora_dropout=0.05, 
    bias="none", 
    task_type="CAUSAL_LM",
)

model = get_peft_model(model, config)

data = load_dataset("Abirate/english_quotes")
data = data.map(lambda samples: tokenizer(samples["quote"]), batched=True)

trainer = Trainer(
    model=model,
    train_dataset=data["train"],
    args=TrainingArguments(
        per_device_train_batch_size=4,
        gradient_accumulation_steps=4,
        warmup_steps=2,
        max_steps=3,
        learning_rate=2e-4,
        fp16=True,
        logging_steps=1,
        output_dir="outputs",
    ),
    data_collator=DataCollatorForLanguageModeling(tokenizer, mlm=False),
)

model.config.use_cache = False
trainer.train()

Note that now there is no need to call

setattr(model, 'model_parallel', True)
setattr(model, 'is_parallelizable', True)

anymore

Apr 03 '23 15:04 younesbelkada

@younesbelkada thanks for the fix. I tried running the same and have one observation. When i am running the flan-t5-xxl 8 bit finetuning on g5.8x.large it is showing 27hrs to complete for an epoch but when i am moving to g5.12x.large it shows 35 hrs to complete the 1 epoch on the same dataset. I thought multi-gpu training will be faster. Is this expected?

Apr 04 '23 15:04 Shreyans92

Hi @Shreyans92 I think it's because CPU offloading might be enabled for some reason, could you try to print model.hf_device_map ?

Apr 04 '23 15:04 younesbelkada

Hi @younesbelkada Here is the device_map:

{'shared': 0, 'decoder.embed_tokens': 0, 'encoder.embed_tokens': 0, 'encoder.block.0': 0, 'encoder.block.1': 0, 'encoder.block.2': 0, 'encoder.block.3': 0, 'encoder.block.4': 0, 'encoder.block.5': 0, 'encoder.block.6': 0, 'encoder.block.7': 0, 'encoder.block.8': 0, 'encoder.block.9': 0, 'encoder.block.10': 0, 'encoder.block.11': 0, 'encoder.block.12': 1, 'encoder.block.13': 1, 'encoder.block.14': 1, 'encoder.block.15': 1, 'encoder.block.16': 1, 'encoder.block.17': 1, 'encoder.block.18': 1, 'encoder.block.19': 1, 'encoder.block.20': 1, 'encoder.block.21': 1, 'encoder.block.22': 1, 'encoder.block.23': 1, 'encoder.final_layer_norm': 1, 'encoder.dropout': 1, 'decoder.block.0': 1, 'decoder.block.1': 1, 'decoder.block.2': 2, 'decoder.block.3': 2, 'decoder.block.4': 2, 'decoder.block.5': 2, 'decoder.block.6': 2, 'decoder.block.7': 2, 'decoder.block.8': 2, 'decoder.block.9': 2, 'decoder.block.10': 2, 'decoder.block.11': 2, 'decoder.block.12': 2, 'decoder.block.13': 2, 'decoder.block.14': 3, 'decoder.block.15': 3, 'decoder.block.16': 3, 'decoder.block.17': 3, 'decoder.block.18': 3, 'decoder.block.19': 3, 'decoder.block.20': 3, 'decoder.block.21': 3, 'decoder.block.22': 3, 'decoder.block.23': 3, 'decoder.final_layer_norm': 3, 'decoder.dropout': 3, 'lm_head': 3}

Looks Like everything is on gpu only.

On Tue, Apr 4, 2023 at 9:18 PM Younes Belkada @.***> wrote:

Hi @Shreyans92 https://github.com/Shreyans92 I think it's because CPU offloading might be enabled for some reason, could you try to print model.hf_device_map ?

— Reply to this email directly, view it on GitHub https://github.com/huggingface/peft/issues/205#issuecomment-1496213378, or unsubscribe https://github.com/notifications/unsubscribe-auth/AEO3R7LKCK3OV4ZVRVO6RV3W7Q7GXANCNFSM6AAAAAAWCV5YTE . You are receiving this because you were mentioned.Message ID: @.***>

Apr 04 '23 16:04 Shreyans92

Hello @Shreyans92,

Multi-GPU in this case is a sort of model parallelism wherein different layers are on different GPUs, also called naive pipelining. In this case, only one GPU is doing the computation while others are idling, hence naive, this allows for fitting big models but at the cost of inefficiency. Hence, the communication overheads of copying data/activations across different devices and the fact that only one GPU is computing while others are idling, it is likely the cause of the slowness you are observing in comparison to the case when the entire model is on the single 24GB GPU (g5.8x.large).

If you can fit the model on a single GPU, you can do Distributed Data Parallelism by having device_map = {"":device} where the device is the GPU number. That way, num_gpus*batch_size is processed simultaneously and all the GPUs are computing all the time. @younesbelkada can provide an example script I think as he has successfully got this working. This is how multi-gpu can accelerate training.

Apr 04 '23 18:04 pacman100

pacman100 @pacman100 ，hi~ I wonder if it possible to do naive pipelining across 2 or more nodes (multi-host)

Apr 08 '23 11:04 taishiciR

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

May 03 '23 15:05 github-actions[bot]

whisper:ValueError: You can't train a model that has been loaded in 8-bit precision on multiple devices

May 18 '23 03:05 junshipeng

peft
peft copied to clipboard

Int8 training example error with multi-gpu - flan-t5

peft peft copied to clipboard

Int8 training example error with multi-gpu - flan-t5

peft
peft copied to clipboard