litgpt OOM for training llama

I'm trying to use the llama-3.2-1B model with the Python API on a compute with 4 Tesla V100s (4*16GB), but the process keeps failing due to OOM. Watching nvidia-smi, I see the utilization shoot up to 16GB on each gpu and then the process dies. The 1B model should work with much lesser VRAM from my understanding, or maybe I'm doing something incorrect. Here is my code:

class LitLLM(L.LightningModule):
    def __init__(self, tokenizer_dir=None, trainer_ckpt_path=None):
        super().__init__()
 
        self.llm = LLM.load("meta-llama/Llama-3.2-1B", distribute=None, access_token=os.getenv("HF_TOKEN"))
        self.trainer_ckpt_path = trainer_ckpt_path

    def setup(self, stage):
        self.llm.trainer_setup(trainer_ckpt=self.trainer_ckpt_path)
        
    def training_step(self, batch):
        logits, loss = self.llm(input_ids=batch["input_ids"], target_ids=batch["labels"])
        self.log("train_loss", loss, prog_bar=True)
        return loss

    def validation_step(self, batch):
        logits, loss = self.llm(input_ids=batch["input_ids"], target_ids=batch["labels"])
        self.log("validation_loss", loss, prog_bar=True)
        return loss

    def configure_optimizers(self):
        warmup_steps = 10
        optimizer = torch.optim.AdamW(self.llm.model.parameters(), lr=0.0002, weight_decay=0.0, betas=(0.9, 0.95))
        scheduler = torch.optim.lr_scheduler.LambdaLR(optimizer, lambda step: step / warmup_steps)
        return [optimizer], [scheduler]


batch_size = 2
accumulate_grad_batches = 1

lit_model = LitLLM()
data = Alpaca2k()

data.connect(lit_model.llm.tokenizer, batch_size=batch_size, max_seq_length=512)

trainer = L.Trainer(
    devices=4,
    accelerator="cuda",
    max_epochs=1,
    accumulate_grad_batches=accumulate_grad_batches,
    precision="bf16-true",
)
trainer.fit(lit_model, data)

The process dies before even the first training pass. I tried a few approaches with quantization, by defining quantize (and other params) in self.llm.distribute in the setup method as well, but none of these approaches seem to work. Any ideas on what I might be doing wrong? Thanks.

Jan 07 '25 22:01 dkapur17

Thanks for the feedback. It does work on 4 x L4s, which have 24 Gb each. I can see that the usage is around 22-24 GB. Other than trying a smaller batch size or block size, or perhaps a different multi-GPU strategy, I am not sure how this can be improved.

Jan 07 '25 23:01 rasbt

@rasbt thanks for the quick rely. So is it taking 22GB in total across the GPUs or on each GPU? I would think a sequential load strategy could help split the model across the GPUs and 64GB should be enough for it, but when using distribute it looks like it conflicts with the trainer. What would be the right way to distribute the model across the GPUs and then train it using the trainer? Also any inputs on quantizing the model?

Jan 08 '25 09:01 dkapur17

It was on each GPU. I think that it uses substantially less RAM than 22 x 4 in total though; it might be that it works just fine on a single GPU with 40 Gb but I haven't tried. You could also consider an FSDP strategy with cpu_offload=True to reduce GPU RAM usage, but this will then take a bit longer to train. Alternatively, the first thing I'd try in your case is to set the batch_size to 1 and then increase the gradient accumulation steps.

Jan 08 '25 14:01 rasbt

Interestingly, using the CLI tool, I'm even able to finetune Llama 3.1 8B with no quantization across the 4 GPUs, although I suspect that's thanks to LoRA, will need to check if it works with the Python API as well.

Jan 08 '25 15:01 dkapur17

Ah yes, litgpt finetune ... uses LoRA by default. For full finetuning, it's litgpt finetune_full ...

Jan 08 '25 16:01 rasbt