sglang icon indicating copy to clipboard operation
sglang copied to clipboard

SRT performance single-core CPU bound

Open qeternity opened this issue 11 months ago • 3 comments

Hi all,

After extensive testing, we find that the SRT backend is pretty CPU bound, often unable to saturate a single GPU at reasonable batch sizes unless using a latest gen consumer CPU with high single core performance.

For instance, a 4090 running on a 7402P performs significantly worse than a 3090ti running on a 7600X (equivalent Python/CUDA/Torch/etc versions)

I presume this is due to single threaded nature of the event loop that is serving uvicorn, but it is nonetheless a significant performance impediment. Additional constraints like regex massively impact serving performance as it appears that the FSMs are also running in this hot path.

Does anyone have any tips on mitigating this in production?

Thanks

qeternity avatar Feb 25 '24 10:02 qeternity