airllm icon indicating copy to clipboard operation
airllm copied to clipboard

low vram usage

Open simonegiuliani opened this issue 1 year ago • 5 comments

I am running the example for: unsloth/Meta-Llama-3.1-405B-Instruct-bnb-4bit I have 48gb VRAM. calling nvidia-smi i see that VRAM usage is never higher than 4g, often is about 200 mb. Is it normal?

simonegiuliani avatar Nov 20 '24 09:11 simonegiuliani

Perhaps these layers are too small in size.

ZhichengQian1 avatar Dec 04 '24 07:12 ZhichengQian1

i am not sure to have understand. what can I do to use more VRAM?

simonegiuliani avatar Dec 04 '24 11:12 simonegiuliani

I have the same question. Is it possible to add a parameter to control how much VRAM we intend to use which maybe speed up the inference.

Xingwei-Tan avatar Jan 08 '25 10:01 Xingwei-Tan

What I don't understand is: are the layers used sequentially or based on neuron/feature activation?

Currently it removes the previous layer and it also prefetches the next layer, could accounting for the free memory help utilize most of the available vRAM?

No clue if it even makes sense, i was just curious about the problem (8GB vRAM GPU) and read the code a bit.

xawos avatar May 28 '25 20:05 xawos

@simonegiuliani That's normal (to see low VRAM usage even if you have a large amount of it). This repo lets you run LLMs without a lot of memory by just-in-time loading a single layer for inference in the VRAM. It's extremely slow because you're bottlenecked by Disk -> RAM -> VRAM bandwidth. There's no solution other than to not use this repo and only offload a portion of the model layers that fit within your VRAM limit (using, for example, ollama + layer offloading).

@xawos This is true for neural networks in general, layers are always activated sequentially. There's no such thing as "feature layer" (I mean, if you squint, you can maybe tag certain layers as encoding a particular feature--but that's rare and unlikely).

Because all of the layers are evaluated in sequence which is why this AirLLM trick works--but you're bottlenecked by bandwidth between Disk -> RAM -> GPU VRAM.

peteygao avatar Oct 08 '25 14:10 peteygao