KoboldAI icon indicating copy to clipboard operation
KoboldAI copied to clipboard

Kobold AI performance when split layers between GPUs

Open conanak99 opened this issue 1 year ago • 2 comments

I'm having some really weird performance issues when loading Kobold AI models into multiple GPUs. I'm using the United version at https://github.com/henk717/KoboldAI. I'm using Linux and installed KoboldAI with play.sh

For testing, I will just use PygmalionAI_pygmalion-350m, a very small model. I load the model using the old UI.

Here is the performance when loading all of them into 1GPU.

INIT       | Info       | Final device configuration:
       DEVICE ID  |  LAYERS  |  DEVICE NAME
   (primary)   0  |      24  |  NVIDIA RTX A5000
               1  |       0  |  NVIDIA RTX A5000
               2  |       0  |  NVIDIA RTX A5000
               3  |       0  |  NVIDIA RTX A5000
               4  |       0  |  NVIDIA RTX A5000
             N/A  |       0  |  (Disk cache)
             N/A  |       0  |  (CPU)

Loading model tensors: 100%|##########| 389/389 [00:01<00:00, 346.13it/s]INFO       | __main__:load_model:3260 - Pipeline created: PygmalionAI_pygmalion-350m
INIT       | Starting   | LUA bridge
INIT       | OK         | LUA bridge
INIT       | Starting   | LUA Scripts
INIT       | OK         | LUA Scripts
Setting Seed
INFO       | __main__:do_connect:4165 - Client connected! UI_1
PROMPT     @ 2023-02-25 17:50:48 | Hi
INFO       | __main__:raw_generate:5763 - Generated 80 tokens in 2.45 seconds, for an average rate of 32.65 tokens per second.

The performance seems really good., but when I try to split the layer between GPU, the performance really degrades (100x slower).

INIT       | Info       | Final device configuration:
       DEVICE ID  |  LAYERS  |  DEVICE NAME
   (primary)   0  |       7  |  NVIDIA RTX A5000
               1  |       7  |  NVIDIA RTX A5000
               2  |       4  |  NVIDIA RTX A5000
               3  |       2  |  NVIDIA RTX A5000
               4  |       4  |  NVIDIA RTX A5000
             N/A  |       0  |  (Disk cache)
             N/A  |       0  |  (CPU)

Loading model tensors: 100%|##########| 389/389 [00:01<00:00, 336.48it/s]INFO       | __main__:load_model:3260 - Pipeline created: PygmalionAI_pygmalion-350m
INIT       | Starting   | LUA bridge
INIT       | OK         | LUA bridge
INIT       | Starting   | LUA Scripts
INIT       | OK         | LUA Scripts
Setting Seed
INFO       | __main__:do_connect:4165 - Client connected! UI_1
INFO       | __main__:do_connect:4165 - Client connected! UI_1
PROMPT     @ 2023-02-25 17:51:44 | HiHi
INFO       | __main__:raw_generate:5763 - Generated 6 tokens in 25.14 seconds, for an average rate of 0.24 tokens per second.

I expect that the performance might be a bit worse due to the overhead of communicating between GPUs, but should it be that worse 😓 ? And seems like no layers are stored in the Disk cache, RAM, or CPU. Is this the expected behavior of Kobold AI, or some problem with my setup 🤔 ?

conanak99 avatar Feb 25 '23 17:02 conanak99