unsloth icon indicating copy to clipboard operation
unsloth copied to clipboard

Unexpected OOM Issue (7B GRPO QLora on H100 80GB)

Open lindafei01 opened this issue 5 months ago • 3 comments

Hi unsloth team, thanks for the amazing work!

I encounter OOM error when running QLora GRPO on deepseek-coder-7b with one H100 80GB.

Package: unsloth==2025.11.3, trl==0.23.0, transformers==4.56.2, torch==2.8.0+cu128

Parameter: batch_size=1, num_generations=8, max_prompt_length=512, max_completion_length=1024

Also, I am using Standby mode.

os.environ["UNSLOTH_VLLM_STANDBY"] = "1"
  def load_model_and_tokenizer(self):
      print(f"Loading model: {self.model_name}")

      self.model, self.tokenizer = FastLanguageModel.from_pretrained(
          model_name=self.model_name,
          max_seq_length=self.max_seq_length,
          load_in_4bit=self.load_in_4bit,
          fast_inference=True,
          gpu_memory_utilization=0.8,
          local_files_only=True
      )

      self.model = FastLanguageModel.get_peft_model(
          self.model,
          r=64,
          target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
                         "gate_proj", "up_proj", "down_proj"],
          lora_alpha=64,
          use_gradient_checkpointing="unsloth",
          random_state=3407,
      )

Based on memory-efficient-rl#h100-experiments, I understand that 14B model with seq_len=32,768 and num_generation=8 can fit well into an H100.

So I am confused why my setting would encounter OOM issue since it's just a 7B model.

Any clues could be helpful. Thanks for the help!

lindafei01 avatar Nov 17 '25 04:11 lindafei01

Hey @lindafei01, if you can share the stack trace and/or wandb of your OOM run, that would be of great help. Also I want to understand, what value of gradient_accumulation_steps are you using?

Datta0 avatar Nov 17 '25 05:11 Datta0

log.txt Thanks for your reply!

The log file is attached here. I am using gradient_accumulation_steps=9

lindafei01 avatar Nov 18 '25 02:11 lindafei01

Hey @lindafei01 I suspect the issue was due to grad_acc_steps being high (9). There was a bug in our code. We have fixed it in https://github.com/unslothai/unsloth/pull/3390. Can you please try installing it with pip install git+https://github.com/unslothai/unsloth.git If that still causes issues, we can look deeper into what is wrong Also while you are running it, can you please set os.environ['UNSLOTH_ENABLE_LOGGING']='1' :)

Datta0 avatar Nov 18 '25 06:11 Datta0

error_llama_13b.log

Hi @Datta0 Thanks! Reducing grad_acc_steps to 4 and using relatively shorter sequence length resolves OOM issue for 7B model.

But training on meta-llama/CodeLlama-13b-Instruct-hf still caused OOM issue. (which should not happen since the benchmark shows that 14B model can fit well into H100)

The hyperparameters are as follows: BATCH_SIZE=1, GRAD_ACCUM_STEPS=4, NUM_GENERATIONS=4, MAX_PROMPT_LENGTH=512, MAX_COMPLETION_LENGTH=768, MAX_SEQ_LENGTH=1280

self.model, self.tokenizer = FastLanguageModel.from_pretrained(
            model_name=self.model_name,
            max_seq_length=self.max_seq_length,
            load_in_4bit=self.load_in_4bit, #true
            fast_inference=True,
            gpu_memory_utilization=0.9,
            local_files_only=True,
        )

# Add LoRA adapters for efficient fine-tuning
self.model = FastLanguageModel.get_peft_model(
    self.model,
    r=64,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
                   "gate_proj", "up_proj", "down_proj"],
    lora_alpha=64,
    use_gradient_checkpointing="unsloth",
    random_state=3407,
)

The log file is attached in this thread.

lindafei01 avatar Nov 20 '25 16:11 lindafei01