airllm low vram usage

I am running the example for: unsloth/Meta-Llama-3.1-405B-Instruct-bnb-4bit I have 48gb VRAM. calling nvidia-smi i see that VRAM usage is never higher than 4g, often is about 200 mb. Is it normal?

Nov 20 '24 09:11 simonegiuliani

Perhaps these layers are too small in size.

Dec 04 '24 07:12 ZhichengQian1

i am not sure to have understand. what can I do to use more VRAM?

Dec 04 '24 11:12 simonegiuliani

I have the same question. Is it possible to add a parameter to control how much VRAM we intend to use which maybe speed up the inference.

Jan 08 '25 10:01 Xingwei-Tan

What I don't understand is: are the layers used sequentially or based on neuron/feature activation?

Currently it removes the previous layer and it also prefetches the next layer, could accounting for the free memory help utilize most of the available vRAM?

No clue if it even makes sense, i was just curious about the problem (8GB vRAM GPU) and read the code a bit.

May 28 '25 20:05 xawos

@simonegiuliani That's normal (to see low VRAM usage even if you have a large amount of it). This repo lets you run LLMs without a lot of memory by just-in-time loading a single layer for inference in the VRAM. It's extremely slow because you're bottlenecked by Disk -> RAM -> VRAM bandwidth. There's no solution other than to not use this repo and only offload a portion of the model layers that fit within your VRAM limit (using, for example, ollama + layer offloading).

@xawos This is true for neural networks in general, layers are always activated sequentially. There's no such thing as "feature layer" (I mean, if you squint, you can maybe tag certain layers as encoding a particular feature--but that's rare and unlikely).

Because all of the layers are evaluated in sequence which is why this AirLLM trick works--but you're bottlenecked by bandwidth between Disk -> RAM -> GPU VRAM.

Oct 08 '25 14:10 peteygao