exllama
exllama copied to clipboard
Question: Does GPU splitting take more ram than running on a single GPU?
Is there any loss when splitting?
Yes, I just tested it. Splitting a 33b model between two GPUs resulted in an additional 1.5GB of VRAM usage.
The implementation needs some temporary buffers on each device, yes. Not sure exactly how much it works out to, but it might be around 1.5 GB for 33b.
Is the key value cache allocated on the main ram, or in one gpu vram, or split between two gpu vram? Will one gpu takes more load than other?
The K/V cache is split between GPUs. It's really collection of caches, one for each layer of the model, so if you put 20 layers on one GPU and 40 layers on another, you'll have 1/3 of the cache on the first GPU, 2/3 on the other.
If you need more precise control of where the layers go, you can manually change config.device_map.layers. Otherwise the auto mapping puts as many layers on each GPU as will fit in the space given by the --gpu_split argument. The cache is allocated separately from the model so gpu_split doesn't account for it.