peft icon indicating copy to clipboard operation
peft copied to clipboard

GPU Allocation Issue (QLoRa + Llama3-8B-IT)

Open DONGRYEOLLEE1 opened this issue 1 year ago • 1 comments

System Info

peft: 0.10.1.dev0 accelerate: 0.30.0 bitsandbytes: 0.43.1 transformers: 4.39.3 GPU: A6000 * 2 ( 96GB ) nvidia-driver version: 535.171.04 cuda: 11.8

Who can help?

No response

Information

  • [ ] The official example scripts
  • [X] My own modified scripts

Tasks

  • [ ] An officially supported task in the examples folder
  • [x] My own task or dataset (give details below)

Reproduction

I was training a Llama3-8B-IT model with QLoRA. I successed a training, but GPU wasn't evenly allocate. Is it a version issue with peft or transformers? Or is it a version issue with the graphics driver? I have experience with learning evenly on previous A100*8 servers, but I don't know if this is an issue in this case.

This is my script.

quantization_config = BitsAndBytesConfig(
    load_in_4bit = True, 
    bnb_4bit_compute_dtype = torch.bfloat16, 
    bnb_4bit_quant_type = "nf4", 
    bnb_4bit_use_double_quant = True
)

MODEL_ID = "meta-llama/Meta-Llama-3-8B-Instruct"

tok = AutoTokenizer.from_pretrained(MODEL_ID)
tok.pad_token_id = tok.eos_token_id
model = AutoModelForCausalLM.from_pretrained(
    MODEL_ID,
    quantization_config = quantization_config,
    device_map = 'auto'
)

data = load_dataset("...")

proc_data = data.map(process, remove_columns = data['train'].column_names)

toknized_proc_data = proc_data.map(lambda x: tok(x['text'], truncation = True, max_length = 2048), batched = True)
toknized_proc_data = toknized_proc_data.remove_columns("text")

lora_config = LoraConfig(
    r = 16,
    lora_alpha = 32,
    lora_dropout = 0.01,
    target_modules = "all-linear"
)

model = get_peft_model(model, lora_config)

train_args_trainer = TrainingArguments(
    num_train_epochs = 3,
    per_device_train_batch_size = 1,
    gradient_accumulation_steps = 2,
    learning_rate = 2e-8,
    logging_steps = 100,
    warmup_steps = 100,
    save_total_limit = 3,
    output_dir = "llama3-7b-4bit-lora-test2",
    optim = "paged_adamw_32bit",
    bf16 = True,
    report_to = "wandb",
    run_name = "llama3-7b-4bit-lora-test2",
    remove_unused_columns=False
)

model.is_parallelizable = True
model.model_parallel = True

trainer = Trainer(
    model = model,
    tokenizer = tok,
    args = train_args_trainer,
    train_dataset = toknized_proc_data['train'],
    data_collator = DataCollatorForLanguageModeling(tok, mlm = False)
)

trainer.train()
Wed May  8 06:54:12 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.171.04             Driver Version: 535.171.04   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA RTX A6000               Off | 00000000:1F:00.0 Off |                  Off |
| 30%   58C    P2             145W / 300W |  13224MiB / 49140MiB |     40%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA RTX A6000               Off | 00000000:8B:00.0 Off |                  Off |
| 47%   71C    P2             221W / 300W |  32908MiB / 49140MiB |     73%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                                         
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A    207219      C   /data/envs/tt/bin/python                  13090MiB |
|    1   N/A  N/A    207219      C   /data/envs/tt/bin/python                  32774MiB |
+---------------------------------------------------------------------------------------+

Expected behavior

I want the GPUs to be evenly allocated.

DONGRYEOLLEE1 avatar May 08 '24 07:05 DONGRYEOLLEE1

Hmm, hard to say and I can't easily try to reproduce this. Do you already see strange behavior after loading the model, before starting training? If you try without PEFT, do you see the same issue (in case of not having enough memory without PEFT, you could e.g. turn off autograd on most of the layers to "simulate" parameter efficient fine-tuning)? If yes, this could be an accelerate issue.

BenjaminBossan avatar May 08 '24 08:05 BenjaminBossan