sglang
sglang copied to clipboard
SRT performance single-core CPU bound
Hi all,
After extensive testing, we find that the SRT backend is pretty CPU bound, often unable to saturate a single GPU at reasonable batch sizes unless using a latest gen consumer CPU with high single core performance.
For instance, a 4090 running on a 7402P performs significantly worse than a 3090ti running on a 7600X (equivalent Python/CUDA/Torch/etc versions)
I presume this is due to single threaded nature of the event loop that is serving uvicorn, but it is nonetheless a significant performance impediment. Additional constraints like regex massively impact serving performance as it appears that the FSMs are also running in this hot path.
Does anyone have any tips on mitigating this in production?
Thanks