Unreasonable VRAM Usage
I have been trying to evaluate a fine-tuned version of LLaMA 3 (fine-tuned with PEFT) with 8B parameters on the leaderboard task using lm-eval 0.4.8. I have set it to evaluate in FP16.
Unfortunately, whenever I try doing it, whether on vLLM or HuggingFace backend, I get an error saying "torch.OutOfMemoryError: CUDA out of memory". It tried allocating 7.83 GiB of VRAM on top of the 79.25 GiB already allocated. I invariably get this error right when it begins running loglikelihood requests.
For context, I am using an Nvidia A100 GPU with 80 GB VRAM, a Miniconda virtual environment, and a Vast.ai system. I have faced similar issues whenever I tried using smaller VRAM amounts. Additionally, it seems unreasonable for the amount of GPU VRAM used to evaluate leaderboard on an 8B model to be anywhere close to 80 GiB.
Is there some setting I'm missing to reduce VRAM usage? Or is there a memory leak somewhere?
Command used: lm-eval --model vllm --model_args pretrained=/root/merged_model,dtype=auto,gpu_memory_utilization=0.9 --tasks leaderboard --batch_size auto --device gpu --output_path /root/results/finetuned
Terminal Screenshot:
I would really appreciate your help.
Hi! This is a known issue with vllm for logliklihood based tasks. You could try setting gpu_memory_utilization quite low (it's used by both the model weights and the kv-cache, so is model/GPU dependent). Could also try setting the batch size manually.
Are you experiencing the issue with hf as well? Generally that should work, if you set the batch_size appropriately. Maybe a bug in the auto computation when it switches over from loglikehood or generation tasks?