exllama icon indicating copy to clipboard operation
exllama copied to clipboard

Question: Does GPU splitting take more ram than running on a single GPU?

Open nikshepsvn opened this issue 2 years ago • 4 comments

Is there any loss when splitting?

nikshepsvn avatar Jun 30 '23 22:06 nikshepsvn

Yes, I just tested it. Splitting a 33b model between two GPUs resulted in an additional 1.5GB of VRAM usage.

shouyiwang avatar Jul 01 '23 12:07 shouyiwang

The implementation needs some temporary buffers on each device, yes. Not sure exactly how much it works out to, but it might be around 1.5 GB for 33b.

turboderp avatar Jul 01 '23 13:07 turboderp

Is the key value cache allocated on the main ram, or in one gpu vram, or split between two gpu vram? Will one gpu takes more load than other?

taowen avatar Jul 02 '23 14:07 taowen

The K/V cache is split between GPUs. It's really collection of caches, one for each layer of the model, so if you put 20 layers on one GPU and 40 layers on another, you'll have 1/3 of the cache on the first GPU, 2/3 on the other.

If you need more precise control of where the layers go, you can manually change config.device_map.layers. Otherwise the auto mapping puts as many layers on each GPU as will fit in the space given by the --gpu_split argument. The cache is allocated separately from the model so gpu_split doesn't account for it.

turboderp avatar Jul 02 '23 14:07 turboderp