transformers Model loading is uneven on GPUs with AutomodelforCasualLM

System Info

python 3.10.10 torch 2.3.1 transformers 4.43.2 optimum 1.17.1 auto_gptq 0.7.1 bitsandbytes 0.43.2 accelerate 0.33.0

Llama3.1 8B Instruct gets loaded like this. So I cant even go more than 1 batch size while finetuning Screenshot 2024-07-24 at 1 54 24 PM Screenshot 2024-07-24 at 1 57 39 PM

Who can help?

No response

Information

[ ] The official example scripts
[ ] My own modified scripts

Tasks

[ ] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[ ] My own task or dataset (give details below)

Reproduction

import sys, gc, torch, random, os
import numpy as np
import pandas as pd
import time
from datasets import load_dataset, Dataset, DatasetDict
from transformers import AutoTokenizer, AutoModelForCausalLM, GPTQConfig
from peft import prepare_model_for_kbit_training, LoraConfig, get_peft_model
from transformers import Trainer, TrainingArguments, DataCollatorForLanguageModeling, BitsAndBytesConfig
from trl import SFTTrainer
import wandb
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
wandb.init(mode = 'disabled')

CONTEXT_LENGTH = 8192
output_dir = "outputs_mi"

model_id = "./llama_models/Meta-Llama-3.1-8B-Instruct-gptq-4bit/"




tokenizer = AutoTokenizer.from_pretrained(model_id, max_seq_length = CONTEXT_LENGTH)
tokenizer.add_eos_token = True
tokenizer.padding_side = 'right'
tokenizer.pad_token = tokenizer.eos_token


model = AutoModelForCausalLM.from_pretrained(model_id, device_map = 'auto')

model.config.use_cache = False  # silence the warnings. Please re-enable for inference!
model.gradient_checkpointing_enable()
model = prepare_model_for_kbit_training(model)


config = LoraConfig(
    r = 64,
    lora_alpha = 64,
    target_modules=["k_proj","o_proj","q_proj","v_proj", "gate_proj", "up_proj", "down_proj"],
    lora_dropout = 0,
    bias="none",
    task_type="CAUSAL_LM",
)

model = get_peft_model(model, config)
print(model.print_trainable_parameters())

data_files = {"train": "full_label_train_data.csv", "test":"full_label_test_data.csv"}
dataset = load_dataset("csv", data_files = data_files)

print(dataset)

training_arguments = TrainingArguments(
   output_dir = output_dir,
    num_train_epochs = 100,
    overwrite_output_dir = True,
    per_device_train_batch_size = 1,
    per_device_eval_batch_size = 1,
    gradient_accumulation_steps = 4,
    optim = "paged_adamw_8bit",
    save_strategy = 'epoch',
    # save_steps = 500,
    warmup_ratio = 0.2,
    logging_steps = 2,
    learning_rate = 4e-4,
    # gradient_checkpointing=True,
    # gradient_checkpointing_kwargs={"use_reentrant": True},
    weight_decay = 0.001,
    fp16 = False,
    bf16 = True,
    max_steps= -1,
    max_grad_norm = 0.3, 
    group_by_length = True,
    lr_scheduler_type= "linear",
    use_cpu = False,
    report_to = "tensorboard",
    eval_strategy = "epoch"    
)

Expected behavior

I would like the model to be loaded evenly so that I can finetune with a larger batch size

Jul 24 '24 17:07 abpani

Have you tried playing with different parameters of the device_map?

You can read more about it and about customizing it here: https://huggingface.co/docs/transformers/big_models#accelerates-big-model-inference

cc @SunMarc I'm trying to find a doc that dives into the different attributes that device_map can accept but not finding any such docs in the transformers docs.

Jul 26 '24 07:07 LysandreJik

still same issue. it shows different errors like loaded in different devices . cuda 0 and cuda 1 device_map = {'model.embed_tokens': 0, 'model.layers.0': 0, 'model.layers.1': 0, 'model.layers.2': 0, 'model.layers.3': 0, 'model.layers.4': 0, 'model.layers.5': 0, 'model.layers.6': 0, 'model.layers.7': 0, 'model.layers.8': 0, 'model.layers.9': 0, 'model.layers.10': 0, 'model.layers.11': 1, 'model.layers.12': 1, 'model.layers.13': 1, 'model.layers.14': 1, 'model.layers.15': 1, 'model.layers.16': 1, 'model.layers.17': 1, 'model.layers.18': 1, 'model.layers.19': 2, 'model.layers.20': 2, 'model.layers.21': 2, 'model.layers.22': 2, 'model.layers.23': 2, 'model.layers.24': 2, 'model.layers.25': 2, 'model.layers.26': 2, 'model.layers.27': 3, 'model.layers.28': 3, 'model.layers.29': 3, 'model.layers.30': 3, 'model.layers.31': 3, 'model.norm': 3, 'lm_head': 3}

Jul 26 '24 15:07 abpani

@LysandreJik You can find the details about device map here https://huggingface.co/meta-llama/Meta-Llama-3.1-8B-Instruct/blob/main/model.safetensors.index.json

Jul 26 '24 15:07 abpani

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

Aug 24 '24 08:08 github-actions[bot]

@LysandreJik I tried as you suggested still same issue in a multi gpu environment.

Aug 25 '24 23:08 abpani

Hey @abpani, the final allocation looks very strange indeed. Can you try with device_map = "sequential" and set max_memory ? Also what do you mean by it shows different errors like loaded in different device. Could you share the traceback ? Thanks !

Aug 26 '24 14:08 SunMarc

@SunMarc funny thing is it does not happen with Mistral models. it works balanced for mistral models. But with qwen, phi, llama still same issue.

Aug 27 '24 18:08 abpani

Hey @abpani, the final allocation looks very strange indeed. Can you try with device_map = "sequential" and set max_memory ? Also what do you mean by it shows different errors like loaded in different device. Could you share the traceback ? Thanks !

I dont have that currently. but still auto devicemap should work fine as it works perfectly with all mistral models.

Aug 27 '24 18:08 abpani

Might just be the not_split_module or simply the sizes of the models

Aug 28 '24 09:08 ArthurZucker

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

Sep 22 '24 08:09 github-actions[bot]

Closing as I believe you have the balanced option 🤗 updating the no_split module is also possible. You can never completely evenly split as the lm head is a lot bigger as a pure layer than say a mlp

Sep 27 '24 15:09 ArthurZucker