KoboldAI
KoboldAI copied to clipboard
Kobold AI performance when split layers between GPUs
I'm having some really weird performance issues when loading Kobold AI models into multiple GPUs. I'm using the United version at https://github.com/henk717/KoboldAI. I'm using Linux and installed KoboldAI with play.sh
For testing, I will just use PygmalionAI_pygmalion-350m
, a very small model. I load the model using the old UI.
Here is the performance when loading all of them into 1GPU.
INIT | Info | Final device configuration:
DEVICE ID | LAYERS | DEVICE NAME
(primary) 0 | 24 | NVIDIA RTX A5000
1 | 0 | NVIDIA RTX A5000
2 | 0 | NVIDIA RTX A5000
3 | 0 | NVIDIA RTX A5000
4 | 0 | NVIDIA RTX A5000
N/A | 0 | (Disk cache)
N/A | 0 | (CPU)
Loading model tensors: 100%|##########| 389/389 [00:01<00:00, 346.13it/s]INFO | __main__:load_model:3260 - Pipeline created: PygmalionAI_pygmalion-350m
INIT | Starting | LUA bridge
INIT | OK | LUA bridge
INIT | Starting | LUA Scripts
INIT | OK | LUA Scripts
Setting Seed
INFO | __main__:do_connect:4165 - Client connected! UI_1
PROMPT @ 2023-02-25 17:50:48 | Hi
INFO | __main__:raw_generate:5763 - Generated 80 tokens in 2.45 seconds, for an average rate of 32.65 tokens per second.
The performance seems really good., but when I try to split the layer between GPU, the performance really degrades (100x slower).
INIT | Info | Final device configuration:
DEVICE ID | LAYERS | DEVICE NAME
(primary) 0 | 7 | NVIDIA RTX A5000
1 | 7 | NVIDIA RTX A5000
2 | 4 | NVIDIA RTX A5000
3 | 2 | NVIDIA RTX A5000
4 | 4 | NVIDIA RTX A5000
N/A | 0 | (Disk cache)
N/A | 0 | (CPU)
Loading model tensors: 100%|##########| 389/389 [00:01<00:00, 336.48it/s]INFO | __main__:load_model:3260 - Pipeline created: PygmalionAI_pygmalion-350m
INIT | Starting | LUA bridge
INIT | OK | LUA bridge
INIT | Starting | LUA Scripts
INIT | OK | LUA Scripts
Setting Seed
INFO | __main__:do_connect:4165 - Client connected! UI_1
INFO | __main__:do_connect:4165 - Client connected! UI_1
PROMPT @ 2023-02-25 17:51:44 | HiHi
INFO | __main__:raw_generate:5763 - Generated 6 tokens in 25.14 seconds, for an average rate of 0.24 tokens per second.
I expect that the performance might be a bit worse due to the overhead of communicating between GPUs, but should it be that worse 😓 ? And seems like no layers are stored in the Disk cache, RAM, or CPU. Is this the expected behavior of Kobold AI, or some problem with my setup 🤔 ?