accelerate GPU Memory Imbalance and OOM Errors During Training

System Info

- `Accelerate` version: 0.30.0
- Platform: Linux-6.5.0-27-generic-x86_64-with-glibc2.35
- `accelerate` bash location: /data/envs/tt/bin/accelerate
- Python version: 3.10.12
- Numpy version: 1.26.4
- PyTorch version (GPU?): 2.2.2+cu118 (True)
- PyTorch XPU available: False
- PyTorch NPU available: False
- PyTorch MLU available: False
- System RAM: 125.62 GB
- GPU type: NVIDIA RTX A6000
- `Accelerate` default config:
        - compute_environment: LOCAL_MACHINE
        - distributed_type: DEEPSPEED
        - use_cpu: False
        - debug: True
        - num_processes: 2
        - machine_rank: 0
        - num_machines: 1
        - rdzv_backend: static
        - same_network: True
        - main_training_function: main
        - enable_cpu_affinity: False
        - deepspeed_config: {'deepspeed_config_file': '/data/dev/', 'zero3_init_flag': True}
        - downcast_bf16: no
        - tpu_use_cluster: False
        - tpu_use_sudo: False
        - tpu_env: []
        - dynamo_config: {'dynamo_backend': 'EAGER', 'dynamo_mode': 'default', 'dynamo_use_dynamic': False, 'dynamo_use_fullgraph': False}

Information

[ ] The official example scripts
[X] My own modified scripts

Tasks

[ ] One of the scripts in the examples/ folder of Accelerate or an officially supported no_trainer script in the examples folder of the transformers repo (such as run_no_trainer_glue.py)
[X] My own task or dataset (give details below)

Reproduction

I was training a Llama3-8B-IT model with QLoRA. I successfully proceeded with the training, but the GPU memory was not allocated evenly. As a result, I encountered an OOM error before completing even 100 steps. Upon checking the GPU memory during training, the imbalance appeared to be even more severe. In my case, GPU 1 used more memory than GPU 0.

I have experience with learning evenly on previous A100*8 servers, but I don't know if this is an issue in this case.

Below are the results of checking the GPU memory using nvidia-smi during training. The issue of memory imbalance allocation is very serious!

+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.171.04             Driver Version: 535.171.04   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA RTX A6000               Off | 00000000:1F:00.0 Off |                  Off |
| 30%   58C    P2             145W / 300W |  13224MiB / 49140MiB |     40%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA RTX A6000               Off | 00000000:8B:00.0 Off |                  Off |
| 47%   71C    P2             221W / 300W |  32908MiB / 49140MiB |     73%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                                         
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A    207219      C   /data/envs/tt/bin/python                  13090MiB |
|    1   N/A  N/A    207219      C   /data/envs/tt/bin/python                  32774MiB |
+---------------------------------------------------------------------------------------+

This is my script:

quantization_config = BitsAndBytesConfig(
    load_in_4bit = True, 
    bnb_4bit_compute_dtype = torch.bfloat16, 
    bnb_4bit_quant_type = "nf4", 
    bnb_4bit_use_double_quant = True
)

MODEL_ID = "meta-llama/Meta-Llama-3-8B-Instruct"

tok = AutoTokenizer.from_pretrained(MODEL_ID)
tok.pad_token_id = tok.eos_token_id
model = AutoModelForCausalLM.from_pretrained(
    MODEL_ID,
    quantization_config = quantization_config,
    device_map = 'auto'
)

data = load_dataset("...")

proc_data = data.map(process, remove_columns = data['train'].column_names)

toknized_proc_data = proc_data.map(lambda x: tok(x['text'], truncation = True, max_length = 2048), batched = True)
toknized_proc_data = toknized_proc_data.remove_columns("text")

lora_config = LoraConfig(
    r = 32,
    lora_alpha = 32,
    lora_dropout = 0.01,
    target_modules = "all-linear"
)

model = get_peft_model(model, lora_config)

args = TrainingArguments(
    num_train_epochs = 1,
    per_device_train_batch_size = 2,
    gradient_accumulation_steps = 4,
    learning_rate = 2e-8,
    logging_steps = 100,
    warmup_steps = 100,
    save_steps = 100,
    save_total_limit = 3,
    output_dir = "llama3-Ko-test",
    optim = "paged_adamw_8bit",
    bf16 = True,
    report_to = "wandb",
    run_name = "llama3-Ko-test",
)

def formatting_func(x):
    return [x]

model.is_parallelizable = True
model.model_parallel = True

trainer = SFTTrainer(
    model = model,
    args = args,
    train_dataset = tok_data['train'],
    formatting_func = formatting_func,
)

trainer.train()

Expected behavior

How can I resolve the GPU memory imbalance issue?

May 17 '24 08:05 DONGRYEOLLEE1

I only changed the model to Llama2, and although the memory imbalance issue still exists, the training works well under the following conditions.

What is the issue with the Llama3 series models?

How on earth can I fix this issue?


MODEL_ID = "meta-llama/Llama-2-7b-chat-hf"  # change a model
...

+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.171.04             Driver Version: 535.171.04   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA RTX A6000               Off | 00000000:1F:00.0 Off |                  Off |
| 34%   60C    P2              93W / 300W |  35556MiB / 49140MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA RTX A6000               Off | 00000000:8B:00.0 Off |                  Off |
| 46%   74C    P2             289W / 300W |  46250MiB / 49140MiB |    100%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                                         
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A    278782      C   python                                    35422MiB |
|    1   N/A  N/A    278782      C   python                                    45988MiB |
+---------------------------------------------------------------------------------------+

May 17 '24 13:05 DONGRYEOLLEE1

I believe this is directly related to PEFT/LoRA, as when I did llama-3-FFT w/o that I did not get a CUDA OOM on 2x4090's, and usage was balanced. (Using FSDP). cc @SunMarc @Titus-von-Koeller @BenjaminBossan

May 17 '24 13:05 muellerzr

Hi @DONGRYEOLLEE1, this is most probably a peft issue. After loading the model, is the model distributed evenly across the 2 gpus ?

May 17 '24 16:05 SunMarc

@SunMarc

Hi @DONGRYEOLLEE1, this is most probably a peft issue. After loading the model, is the model distributed evenly across the 2 gpus ?

First of all, I really thank to your reply.

The following shows the GPU memory status right after loading the Llama3 model.

| NVIDIA-SMI 535.171.04             Driver Version: 535.171.04   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA RTX A6000               Off | 00000000:1F:00.0 Off |                  Off |
| 30%   39C    P8              27W / 300W |   2212MiB / 49140MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA RTX A6000               Off | 00000000:8B:00.0 Off |                  Off |
| 30%   42C    P8              33W / 300W |   3990MiB / 49140MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                                         
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A    314290      C   /data/envs/llm_t/bin/python                2206MiB |
|    1   N/A  N/A    314290      C   /data/envs/llm_t/bin/python                3984MiB |
+---------------------------------------------------------------------------------------+

May 20 '24 01:05 DONGRYEOLLEE1

@muellerzr

I believe this is directly related to PEFT/LoRA, as when I did llama-3-FFT w/o that I did not get a CUDA OOM on 2x4090's, and usage was balanced. (Using FSDP). cc @SunMarc @Titus-von-Koeller @BenjaminBossan

Could you let me know the version of peft you used for fine-tuning?

In my case, I used a peft==0.10.0.

May 20 '24 01:05 DONGRYEOLLEE1

@DONGRYEOLLEE1 i did not use PEFT, hence what I meant by full-fine-tuning with FSDP

May 20 '24 08:05 muellerzr

I tried to reproduce but still have very little experience with DeepSpeed, so I may be doing something wrong. When I try to start the script with accelerate launch, I get:

ValueError: You can't train a model that has been loaded with device_map='auto' in any distributed mode

So @DONGRYEOLLEE1 did you just launch with python ...? If I do that, I also get imbalanced memory, but I'm not sure if this is using DS correctly.

when I did llama-3-FFT w/o that I did not get a CUDA OOM on 2x4090's, and usage was balanced.

Did you change anything else? As the model is bnb quantized, full fine-tuning should not work, right?

May 21 '24 11:05 BenjaminBossan

@BenjaminBossan I needed CPU offloading to get it working, so quite slow but no bnb/quantization was used.

May 21 '24 11:05 muellerzr

@BenjaminBossan

I just launched in jupyter notebook instead of python script for python ....

In the end, I solved the issue using DeepSpeed + QLoRA for example.

And I tried actions such as changing the versions of PEFT and Accelerate, but the memory imbalance issue still exists when conducting training in Jupyter Notebook.

The following shows the GPU memory status when using the DS+QLoRA method. (batch_size = 2)

+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.171.04             Driver Version: 535.171.04   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA RTX A6000               Off | 00000000:1F:00.0 Off |                  Off |
| 43%   67C    P2             212W / 300W |  13532MiB / 49140MiB |     84%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA RTX A6000               Off | 00000000:8B:00.0 Off |                  Off |
| 47%   71C    P2             203W / 300W |  12328MiB / 49140MiB |    100%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                                         
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A      4026      C   /data/envs/llm_test/bin/python            13526MiB |
|    1   N/A  N/A      4027      C   /data/envs/llm_test/bin/python            12322MiB |
+---------------------------------------------------------------------------------------+

May 22 '24 05:05 DONGRYEOLLEE1

In the end, I solved the issue using DeepSpeed + QLoRA for example.

And I tried actions such as changing the versions of PEFT and Accelerate, but the memory imbalance issue still exists when conducting training in Jupyter Notebook.

Hmm, I'm confused, is the issue solved or not? :)

May 22 '24 09:05 BenjaminBossan

In the end, I solved the issue using DeepSpeed + QLoRA for example. And I tried actions such as changing the versions of PEFT and Accelerate, but the memory imbalance issue still exists when conducting training in Jupyter Notebook.

Hmm, I'm confused, is the issue solved or not? :)

Oh, This issue wasn't solved for my script.

May 29 '24 01:05 DONGRYEOLLEE1

Could you show us how you launch the script? Also, from the last nvidia-smi output you posted, memory usage is 13532MiB and 12328MiB. This looks rather fine to me, I wouldn't expect usage to be 100% identical. Or is that referring to something else?

May 29 '24 09:05 BenjaminBossan

Could you show us how you launch the script? Also, from the last nvidia-smi output you posted, memory usage is 13532MiB and 12328MiB. This looks rather fine to me, I wouldn't expect usage to be 100% identical. Or is that referring to something else?

My tarining script is provided in the reproduction section above.

This is the state of the GPU shortly after the start of training.

|=========================================+======================+======================|
|   0  NVIDIA RTX A6000               Off | 00000000:1F:00.0 Off |                  Off |
| 30%   58C    P2             145W / 300W |  13224MiB / 49140MiB |     40%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA RTX A6000               Off | 00000000:8B:00.0 Off |                  Off |
| 47%   71C    P2             221W / 300W |  32908MiB / 49140MiB |     73%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

This is the state of the GPU just before OOM.

|=========================================+======================+======================|
|   0  NVIDIA RTX A6000               Off | 00000000:1F:00.0 Off |                  Off |
| 34%   60C    P2              93W / 300W |  35556MiB / 49140MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA RTX A6000               Off | 00000000:8B:00.0 Off |                  Off |
| 46%   74C    P2             289W / 300W |  46250MiB / 49140MiB |    100%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

The GPU state of 13.3GB / 12.4GB reflects the GPU status during training using the deepspeed+qlora methodology (It works well w/o any imbalancing !!). While training with deepspeed did not show memory imbalance issues, using my script in Jupyter Notebook does result in such imbalance.

May 31 '24 07:05 DONGRYEOLLEE1

My tarining script is provided in the reproduction section above.

Yes, I mean how do you launch the training script exactly?

3. While training with deepspeed did not show memory imbalance issues, using my script in Jupyter Notebook does result in such imbalance.

Thanks for clarifying. In that case, I don't think it's PEFT related. @muellerzr any idea why this could be? Is some setting not being passed correctly?

May 31 '24 09:05 BenjaminBossan

I'd need to see the entire notebook/a full reproducer entirely/how you are launching it with the notebook_launcher. There could be some weird things with torch perhaps, I can try and look into this a little

Jun 06 '24 14:06 muellerzr

I am also encountering this behaviour whilst trying to fine-tune Llama3-8B using QLoRA. However, in my case I'm not using DeepSpeed (at least there's no deepspeed_config parameter in my accelerator config file). My script is launched with python3.

Here's the output from nvidia-smi during training:

Jun 25 '24 15:06 Paul-Richmond

Hi @Paul-Richmond, could you print model.hf_device_map. The imbalance is quite strange since it only uses the second and the third gpu. Could you also share a minimal reproducer ? Thanks !

Jun 25 '24 16:06 SunMarc

Hi @SunMarc, thanks for the quick reply! I'm running my script on an HPC cluster where I only request 2 GPUs from a node comprising of 4 GPUs in total.

Here is a minimal reproducer script:

import os
from dotenv import load_dotenv
import wandb
import huggingface_hub
from datasets import load_dataset
from transformers import (AutoTokenizer,
                          DataCollatorForLanguageModeling,
                          AutoModelForCausalLM,
                          BitsAndBytesConfig,
                          TrainingArguments,
                          Trainer,
                          )
from transformers.optimization import get_cosine_with_min_lr_schedule_with_warmup
import torch
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training


def create_labels(examples):
    examples["labels"] = examples["input_ids"].copy()
    return examples


def main():
    load_dotenv()
    HF_TOKEN = os.getenv("HUGGINGFACE_API_KEY")
    WANDB_TOKEN = os.getenv("WANDB_API_KEY")

    huggingface_hub.login(token=HF_TOKEN)
    wandb.login(key=WANDB_TOKEN)

    ds = load_dataset("yelp_review_full", split="train[:73047]")

    tokenizer = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3-8B")
    tokenizer.pad_token = tokenizer.eos_token
    tokenised_ds = ds.map(lambda examples: tokenizer(examples["text"],
                                                     padding="max_length",
                                                     max_length=720,
                                                     truncation=True),
                          batched=True,
                          remove_columns=ds.column_names)

    lm_dataset = tokenised_ds.map(create_labels, batched=True)
    train_dataset = lm_dataset

    data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False)

    training_args = TrainingArguments(output_dir="hf",
                                      evaluation_strategy="no",
                                      per_device_train_batch_size=24,
                                      per_device_eval_batch_size=24,
                                      max_grad_norm=1.0,
                                      report_to="wandb",
                                      run_name="GPU_memory_imbalance",
                                      push_to_hub=False)

    quant_config = BitsAndBytesConfig(load_in_4bit=True,
                                      bnb_4bit_quant_type="nf4",
                                      bnb_4bit_quant_storage=None,
                                      bnb_4bit_compute_dtype="bfloat16",
                                      bnb_4bit_use_double_quant=True)

    lora_config = LoraConfig(r=8,
                             lora_alpha=32,
                             lora_dropout=0.05,
                             bias="none",
                             task_type="CAUSAL_LM",
                             target_modules=["up_proj",
                                             "down_proj",
                                             "gate_proj",
                                             "k_proj",
                                             "q_proj",
                                             "v_proj",
                                             "o_proj"])

    foundation_model = AutoModelForCausalLM.from_pretrained("meta-llama/Meta-Llama-3-8B",
                                                            device_map="auto",
                                                            trust_remote_code=True,
                                                            attn_implementation="flash_attention_2",
                                                            quantization_config=quant_config
                                                            )
    print(f"foundation_model hf_device_map: {foundation_model.hf_device_map}")
    model = prepare_model_for_kbit_training(foundation_model)
    print(f"prepare_model_for_kbit_training hf_device_map: {model.hf_device_map}")
    model = get_peft_model(model, lora_config)
    print(f"get_peft_model hf_device_map: {model.hf_device_map}")

    optimizer = torch.optim.AdamW(model.parameters(),
                                  lr=0.0003,
                                  weight_decay=0.1,
                                  betas=(0.9, 0.95),
                                  eps=1.0e-05)

    lr_scheduler = get_cosine_with_min_lr_schedule_with_warmup(optimizer,
                                                               num_training_steps=9132,
                                                               num_warmup_steps=91,
                                                               num_cycles=0.5,
                                                               last_epoch=-1,
                                                               min_lr=0.1)

    trainer = Trainer(model=model,
                      args=training_args,
                      train_dataset=train_dataset,
                      eval_dataset=None,
                      data_collator=data_collator,
                      optimizers=(optimizer, lr_scheduler)
                      )
    print(f"trainer hf_device_map: {trainer.model.hf_device_map}")
    trainer.train()
    huggingface_hub.logout()
    wandb.finish()


if __name__ == "__main__":
    main()

The result from model.hf_device_map is as follows: Untitled 2 There does seem to be an imbalance with 9 quantities mapped to GPU0 and 26 to GPU1.

The nvidia-smi output is as before only now GPUs 0 and 1 are being used: Untitled

Jun 26 '24 10:06 Paul-Richmond

Thanks for the reproducer @Paul-Richmond ! I'll keep you updated !

Jun 26 '24 13:06 SunMarc

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

Jul 20 '24 15:07 github-actions[bot]