GPU Memory Imbalance and OOM Errors During Training
System Info
- `Accelerate` version: 0.30.0
- Platform: Linux-6.5.0-27-generic-x86_64-with-glibc2.35
- `accelerate` bash location: /data/envs/tt/bin/accelerate
- Python version: 3.10.12
- Numpy version: 1.26.4
- PyTorch version (GPU?): 2.2.2+cu118 (True)
- PyTorch XPU available: False
- PyTorch NPU available: False
- PyTorch MLU available: False
- System RAM: 125.62 GB
- GPU type: NVIDIA RTX A6000
- `Accelerate` default config:
- compute_environment: LOCAL_MACHINE
- distributed_type: DEEPSPEED
- use_cpu: False
- debug: True
- num_processes: 2
- machine_rank: 0
- num_machines: 1
- rdzv_backend: static
- same_network: True
- main_training_function: main
- enable_cpu_affinity: False
- deepspeed_config: {'deepspeed_config_file': '/data/dev/', 'zero3_init_flag': True}
- downcast_bf16: no
- tpu_use_cluster: False
- tpu_use_sudo: False
- tpu_env: []
- dynamo_config: {'dynamo_backend': 'EAGER', 'dynamo_mode': 'default', 'dynamo_use_dynamic': False, 'dynamo_use_fullgraph': False}
Information
- [ ] The official example scripts
- [X] My own modified scripts
Tasks
- [ ] One of the scripts in the examples/ folder of Accelerate or an officially supported
no_trainerscript in theexamplesfolder of thetransformersrepo (such asrun_no_trainer_glue.py) - [X] My own task or dataset (give details below)
Reproduction
I was training a Llama3-8B-IT model with QLoRA. I successfully proceeded with the training, but the GPU memory was not allocated evenly. As a result, I encountered an OOM error before completing even 100 steps. Upon checking the GPU memory during training, the imbalance appeared to be even more severe. In my case, GPU 1 used more memory than GPU 0.
I have experience with learning evenly on previous A100*8 servers, but I don't know if this is an issue in this case.
Below are the results of checking the GPU memory using nvidia-smi during training.
The issue of memory imbalance allocation is very serious!
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.171.04 Driver Version: 535.171.04 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA RTX A6000 Off | 00000000:1F:00.0 Off | Off |
| 30% 58C P2 145W / 300W | 13224MiB / 49140MiB | 40% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 1 NVIDIA RTX A6000 Off | 00000000:8B:00.0 Off | Off |
| 47% 71C P2 221W / 300W | 32908MiB / 49140MiB | 73% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| 0 N/A N/A 207219 C /data/envs/tt/bin/python 13090MiB |
| 1 N/A N/A 207219 C /data/envs/tt/bin/python 32774MiB |
+---------------------------------------------------------------------------------------+
This is my script:
quantization_config = BitsAndBytesConfig(
load_in_4bit = True,
bnb_4bit_compute_dtype = torch.bfloat16,
bnb_4bit_quant_type = "nf4",
bnb_4bit_use_double_quant = True
)
MODEL_ID = "meta-llama/Meta-Llama-3-8B-Instruct"
tok = AutoTokenizer.from_pretrained(MODEL_ID)
tok.pad_token_id = tok.eos_token_id
model = AutoModelForCausalLM.from_pretrained(
MODEL_ID,
quantization_config = quantization_config,
device_map = 'auto'
)
data = load_dataset("...")
proc_data = data.map(process, remove_columns = data['train'].column_names)
toknized_proc_data = proc_data.map(lambda x: tok(x['text'], truncation = True, max_length = 2048), batched = True)
toknized_proc_data = toknized_proc_data.remove_columns("text")
lora_config = LoraConfig(
r = 32,
lora_alpha = 32,
lora_dropout = 0.01,
target_modules = "all-linear"
)
model = get_peft_model(model, lora_config)
args = TrainingArguments(
num_train_epochs = 1,
per_device_train_batch_size = 2,
gradient_accumulation_steps = 4,
learning_rate = 2e-8,
logging_steps = 100,
warmup_steps = 100,
save_steps = 100,
save_total_limit = 3,
output_dir = "llama3-Ko-test",
optim = "paged_adamw_8bit",
bf16 = True,
report_to = "wandb",
run_name = "llama3-Ko-test",
)
def formatting_func(x):
return [x]
model.is_parallelizable = True
model.model_parallel = True
trainer = SFTTrainer(
model = model,
args = args,
train_dataset = tok_data['train'],
formatting_func = formatting_func,
)
trainer.train()
Expected behavior
How can I resolve the GPU memory imbalance issue?
I only changed the model to Llama2, and although the memory imbalance issue still exists, the training works well under the following conditions.
What is the issue with the Llama3 series models?
How on earth can I fix this issue?
MODEL_ID = "meta-llama/Llama-2-7b-chat-hf" # change a model
...
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.171.04 Driver Version: 535.171.04 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA RTX A6000 Off | 00000000:1F:00.0 Off | Off |
| 34% 60C P2 93W / 300W | 35556MiB / 49140MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 1 NVIDIA RTX A6000 Off | 00000000:8B:00.0 Off | Off |
| 46% 74C P2 289W / 300W | 46250MiB / 49140MiB | 100% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| 0 N/A N/A 278782 C python 35422MiB |
| 1 N/A N/A 278782 C python 45988MiB |
+---------------------------------------------------------------------------------------+
I believe this is directly related to PEFT/LoRA, as when I did llama-3-FFT w/o that I did not get a CUDA OOM on 2x4090's, and usage was balanced. (Using FSDP). cc @SunMarc @Titus-von-Koeller @BenjaminBossan
Hi @DONGRYEOLLEE1, this is most probably a peft issue. After loading the model, is the model distributed evenly across the 2 gpus ?
@SunMarc
Hi @DONGRYEOLLEE1, this is most probably a peft issue. After loading the model, is the model distributed evenly across the 2 gpus ?
First of all, I really thank to your reply.
The following shows the GPU memory status right after loading the Llama3 model.
| NVIDIA-SMI 535.171.04 Driver Version: 535.171.04 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA RTX A6000 Off | 00000000:1F:00.0 Off | Off |
| 30% 39C P8 27W / 300W | 2212MiB / 49140MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 1 NVIDIA RTX A6000 Off | 00000000:8B:00.0 Off | Off |
| 30% 42C P8 33W / 300W | 3990MiB / 49140MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| 0 N/A N/A 314290 C /data/envs/llm_t/bin/python 2206MiB |
| 1 N/A N/A 314290 C /data/envs/llm_t/bin/python 3984MiB |
+---------------------------------------------------------------------------------------+
@muellerzr
I believe this is directly related to PEFT/LoRA, as when I did llama-3-FFT w/o that I did not get a CUDA OOM on 2x4090's, and usage was balanced. (Using FSDP). cc @SunMarc @Titus-von-Koeller @BenjaminBossan
Could you let me know the version of peft you used for fine-tuning?
In my case, I used a peft==0.10.0.
@DONGRYEOLLEE1 i did not use PEFT, hence what I meant by full-fine-tuning with FSDP
I tried to reproduce but still have very little experience with DeepSpeed, so I may be doing something wrong. When I try to start the script with accelerate launch, I get:
ValueError: You can't train a model that has been loaded with
device_map='auto'in any distributed mode
So @DONGRYEOLLEE1 did you just launch with python ...? If I do that, I also get imbalanced memory, but I'm not sure if this is using DS correctly.
when I did llama-3-FFT w/o that I did not get a CUDA OOM on 2x4090's, and usage was balanced.
Did you change anything else? As the model is bnb quantized, full fine-tuning should not work, right?
@BenjaminBossan I needed CPU offloading to get it working, so quite slow but no bnb/quantization was used.
@BenjaminBossan
I just launched in jupyter notebook instead of python script for python ....
In the end, I solved the issue using DeepSpeed + QLoRA for example.
And I tried actions such as changing the versions of PEFT and Accelerate, but the memory imbalance issue still exists when conducting training in Jupyter Notebook.
The following shows the GPU memory status when using the DS+QLoRA method. (batch_size = 2)
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.171.04 Driver Version: 535.171.04 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA RTX A6000 Off | 00000000:1F:00.0 Off | Off |
| 43% 67C P2 212W / 300W | 13532MiB / 49140MiB | 84% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 1 NVIDIA RTX A6000 Off | 00000000:8B:00.0 Off | Off |
| 47% 71C P2 203W / 300W | 12328MiB / 49140MiB | 100% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| 0 N/A N/A 4026 C /data/envs/llm_test/bin/python 13526MiB |
| 1 N/A N/A 4027 C /data/envs/llm_test/bin/python 12322MiB |
+---------------------------------------------------------------------------------------+
In the end, I solved the issue using DeepSpeed + QLoRA for example.
And I tried actions such as changing the versions of
PEFTandAccelerate, but the memory imbalance issue still exists when conducting training in Jupyter Notebook.
Hmm, I'm confused, is the issue solved or not? :)
In the end, I solved the issue using DeepSpeed + QLoRA for example. And I tried actions such as changing the versions of
PEFTandAccelerate, but the memory imbalance issue still exists when conducting training in Jupyter Notebook.Hmm, I'm confused, is the issue solved or not? :)
Oh, This issue wasn't solved for my script.
Could you show us how you launch the script? Also, from the last nvidia-smi output you posted, memory usage is 13532MiB and 12328MiB. This looks rather fine to me, I wouldn't expect usage to be 100% identical. Or is that referring to something else?
Could you show us how you launch the script? Also, from the last nvidia-smi output you posted, memory usage is 13532MiB and 12328MiB. This looks rather fine to me, I wouldn't expect usage to be 100% identical. Or is that referring to something else?
My tarining script is provided in the reproduction section above.
- This is the state of the GPU shortly after the start of training.
|=========================================+======================+======================|
| 0 NVIDIA RTX A6000 Off | 00000000:1F:00.0 Off | Off |
| 30% 58C P2 145W / 300W | 13224MiB / 49140MiB | 40% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 1 NVIDIA RTX A6000 Off | 00000000:8B:00.0 Off | Off |
| 47% 71C P2 221W / 300W | 32908MiB / 49140MiB | 73% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
- This is the state of the GPU just before OOM.
|=========================================+======================+======================|
| 0 NVIDIA RTX A6000 Off | 00000000:1F:00.0 Off | Off |
| 34% 60C P2 93W / 300W | 35556MiB / 49140MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 1 NVIDIA RTX A6000 Off | 00000000:8B:00.0 Off | Off |
| 46% 74C P2 289W / 300W | 46250MiB / 49140MiB | 100% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
- The GPU state of 13.3GB / 12.4GB reflects the GPU status during training using the deepspeed+qlora methodology (It works well w/o any imbalancing !!). While training with deepspeed did not show memory imbalance issues, using my script in Jupyter Notebook does result in such imbalance.
My tarining script is provided in the reproduction section above.
Yes, I mean how do you launch the training script exactly?
3. While training with deepspeed did not show memory imbalance issues, using my script in Jupyter Notebook does result in such imbalance.
Thanks for clarifying. In that case, I don't think it's PEFT related. @muellerzr any idea why this could be? Is some setting not being passed correctly?
I'd need to see the entire notebook/a full reproducer entirely/how you are launching it with the notebook_launcher. There could be some weird things with torch perhaps, I can try and look into this a little
I am also encountering this behaviour whilst trying to fine-tune Llama3-8B using QLoRA. However, in my case I'm not using DeepSpeed (at least there's no deepspeed_config parameter in my accelerator config file). My script is launched with python3.
Here's the output from nvidia-smi during training:
Hi @Paul-Richmond, could you print model.hf_device_map. The imbalance is quite strange since it only uses the second and the third gpu. Could you also share a minimal reproducer ? Thanks !
Hi @SunMarc, thanks for the quick reply! I'm running my script on an HPC cluster where I only request 2 GPUs from a node comprising of 4 GPUs in total.
Here is a minimal reproducer script:
import os
from dotenv import load_dotenv
import wandb
import huggingface_hub
from datasets import load_dataset
from transformers import (AutoTokenizer,
DataCollatorForLanguageModeling,
AutoModelForCausalLM,
BitsAndBytesConfig,
TrainingArguments,
Trainer,
)
from transformers.optimization import get_cosine_with_min_lr_schedule_with_warmup
import torch
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
def create_labels(examples):
examples["labels"] = examples["input_ids"].copy()
return examples
def main():
load_dotenv()
HF_TOKEN = os.getenv("HUGGINGFACE_API_KEY")
WANDB_TOKEN = os.getenv("WANDB_API_KEY")
huggingface_hub.login(token=HF_TOKEN)
wandb.login(key=WANDB_TOKEN)
ds = load_dataset("yelp_review_full", split="train[:73047]")
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3-8B")
tokenizer.pad_token = tokenizer.eos_token
tokenised_ds = ds.map(lambda examples: tokenizer(examples["text"],
padding="max_length",
max_length=720,
truncation=True),
batched=True,
remove_columns=ds.column_names)
lm_dataset = tokenised_ds.map(create_labels, batched=True)
train_dataset = lm_dataset
data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False)
training_args = TrainingArguments(output_dir="hf",
evaluation_strategy="no",
per_device_train_batch_size=24,
per_device_eval_batch_size=24,
max_grad_norm=1.0,
report_to="wandb",
run_name="GPU_memory_imbalance",
push_to_hub=False)
quant_config = BitsAndBytesConfig(load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_quant_storage=None,
bnb_4bit_compute_dtype="bfloat16",
bnb_4bit_use_double_quant=True)
lora_config = LoraConfig(r=8,
lora_alpha=32,
lora_dropout=0.05,
bias="none",
task_type="CAUSAL_LM",
target_modules=["up_proj",
"down_proj",
"gate_proj",
"k_proj",
"q_proj",
"v_proj",
"o_proj"])
foundation_model = AutoModelForCausalLM.from_pretrained("meta-llama/Meta-Llama-3-8B",
device_map="auto",
trust_remote_code=True,
attn_implementation="flash_attention_2",
quantization_config=quant_config
)
print(f"foundation_model hf_device_map: {foundation_model.hf_device_map}")
model = prepare_model_for_kbit_training(foundation_model)
print(f"prepare_model_for_kbit_training hf_device_map: {model.hf_device_map}")
model = get_peft_model(model, lora_config)
print(f"get_peft_model hf_device_map: {model.hf_device_map}")
optimizer = torch.optim.AdamW(model.parameters(),
lr=0.0003,
weight_decay=0.1,
betas=(0.9, 0.95),
eps=1.0e-05)
lr_scheduler = get_cosine_with_min_lr_schedule_with_warmup(optimizer,
num_training_steps=9132,
num_warmup_steps=91,
num_cycles=0.5,
last_epoch=-1,
min_lr=0.1)
trainer = Trainer(model=model,
args=training_args,
train_dataset=train_dataset,
eval_dataset=None,
data_collator=data_collator,
optimizers=(optimizer, lr_scheduler)
)
print(f"trainer hf_device_map: {trainer.model.hf_device_map}")
trainer.train()
huggingface_hub.logout()
wandb.finish()
if __name__ == "__main__":
main()
The result from model.hf_device_map is as follows:
There does seem to be an imbalance with 9 quantities mapped to GPU0 and 26 to GPU1.
The nvidia-smi output is as before only now GPUs 0 and 1 are being used:
Thanks for the reproducer @Paul-Richmond ! I'll keep you updated !
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.