Unexpected OOM Issue (7B GRPO QLora on H100 80GB)
Hi unsloth team, thanks for the amazing work!
I encounter OOM error when running QLora GRPO on deepseek-coder-7b with one H100 80GB.
Package: unsloth==2025.11.3, trl==0.23.0, transformers==4.56.2, torch==2.8.0+cu128
Parameter: batch_size=1, num_generations=8, max_prompt_length=512, max_completion_length=1024
Also, I am using Standby mode.
os.environ["UNSLOTH_VLLM_STANDBY"] = "1"
def load_model_and_tokenizer(self):
print(f"Loading model: {self.model_name}")
self.model, self.tokenizer = FastLanguageModel.from_pretrained(
model_name=self.model_name,
max_seq_length=self.max_seq_length,
load_in_4bit=self.load_in_4bit,
fast_inference=True,
gpu_memory_utilization=0.8,
local_files_only=True
)
self.model = FastLanguageModel.get_peft_model(
self.model,
r=64,
target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
"gate_proj", "up_proj", "down_proj"],
lora_alpha=64,
use_gradient_checkpointing="unsloth",
random_state=3407,
)
Based on memory-efficient-rl#h100-experiments, I understand that 14B model with seq_len=32,768 and num_generation=8 can fit well into an H100.
So I am confused why my setting would encounter OOM issue since it's just a 7B model.
Any clues could be helpful. Thanks for the help!
Hey @lindafei01, if you can share the stack trace and/or wandb of your OOM run, that would be of great help.
Also I want to understand, what value of gradient_accumulation_steps are you using?
log.txt Thanks for your reply!
The log file is attached here. I am using gradient_accumulation_steps=9
Hey @lindafei01 I suspect the issue was due to grad_acc_steps being high (9). There was a bug in our code. We have fixed it in https://github.com/unslothai/unsloth/pull/3390. Can you please try installing it with
pip install git+https://github.com/unslothai/unsloth.git
If that still causes issues, we can look deeper into what is wrong
Also while you are running it, can you please set os.environ['UNSLOTH_ENABLE_LOGGING']='1' :)
Hi @Datta0 Thanks! Reducing grad_acc_steps to 4 and using relatively shorter sequence length resolves OOM issue for 7B model.
But training on meta-llama/CodeLlama-13b-Instruct-hf still caused OOM issue. (which should not happen since the benchmark shows that 14B model can fit well into H100)
The hyperparameters are as follows: BATCH_SIZE=1, GRAD_ACCUM_STEPS=4, NUM_GENERATIONS=4, MAX_PROMPT_LENGTH=512, MAX_COMPLETION_LENGTH=768, MAX_SEQ_LENGTH=1280
self.model, self.tokenizer = FastLanguageModel.from_pretrained(
model_name=self.model_name,
max_seq_length=self.max_seq_length,
load_in_4bit=self.load_in_4bit, #true
fast_inference=True,
gpu_memory_utilization=0.9,
local_files_only=True,
)
# Add LoRA adapters for efficient fine-tuning
self.model = FastLanguageModel.get_peft_model(
self.model,
r=64,
target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
"gate_proj", "up_proj", "down_proj"],
lora_alpha=64,
use_gradient_checkpointing="unsloth",
random_state=3407,
)
The log file is attached in this thread.