text-generation-webui
text-generation-webui copied to clipboard
Multiprocess, Multithreading and Paralelism
So I notice that when CPU offloading there is only one core used. This seems like a bottle neck. On Flexgen it can use everything and the generations rival the GPU itself.
I could be wrong but this is due to the python GIL. It allows only one core to be consumed by python and htop confirms this.
They are trying to remove the lock and there is already a no-gil build of python There are some libraries and coding tricks to get around it. e.g, external modules not requiring it.
https://github.com/colesbury/nogil
It starts in 3.10 which is what we are already using. Thoughts on this? It could speed up offloading a bit?
I think it might even affect multi GPU setups because stuff like this should not be happening: https://github.com/henk717/KoboldAI/issues/295
If only a single core is ever used, the transfers probably go much slower.
Thoughts on this? Realistic or not?
FlexGen doesn't do the calculations in the CPU, it just creates an efficient schedule for sending layers to the GPU while keeping the inactive layers in a RAM/disk cache (if I understand correctly).
I suppose the offloading strategy implemented in accelerate (which is the one used in this repository) works the same way.
If I put 100 0 0 100 0 100 for flexgen it pegs all of the cores when generating at 100%
Perhaps accelerate is locked to one thread. They listed a way to use multi CPU with mpi but that might be bare metal only.
I see that GPTQ and RWKV can use all cores so this is an accelerate problem mainly.
Any update on this? This is currently a huge bottleneck for me.
Hard to say. When using multiple cores for the stuff that did, it was often slower. But people report CPU bottlenecks all over the place.
Looking forward. Why is this closed?