sglang icon indicating copy to clipboard operation
sglang copied to clipboard

flushing cache effect on throughput

Open amirarsalan90 opened this issue 1 year ago • 5 comments

When running a model with --model-mode flashinfer (I have tested mistralai/Mistral-7B-Instruct-v0.2), for a large batch (eg 50,000 text input), I usually see that the throughput is high the first few minutes and then it starts degrading.

Does using flush_cache every few iterations (like in steps of 256 iterations) create any improvement? Does this make sense to create bathces of 256 and do http://0.0.0.0:80000/flush_cache upon sending the request for each batch?

amirarsalan90 avatar Mar 07 '24 20:03 amirarsalan90

That depends on your use cases. If you only get high throughput in the first few requests, I imagine your requests have less common prompts so that you barely benefit from RadixAttention but suffer from cache eviction overhead. In this case flushing cache may help with throughput.

comaniac avatar Mar 07 '24 21:03 comaniac

Thanks! My requests all have a common prefix (sys prompt), but they are diverse after the prefix.

amirarsalan90 avatar Mar 07 '24 21:03 amirarsalan90

That should still benefit from RadixAttention, so you shouldn't see throughput dropping after a while unless your system prompt is very short. Anyways you can firstly try to flush cache every N requests to see if that helps.

comaniac avatar Mar 07 '24 23:03 comaniac

@comaniac This thread is fascinating. My naive question is how can cache evictions have such high cost for RadixAttention as to cause throughput slowdowns? I assume the caches are just mapped blocks of gpu memory (non-contagious) and on cache eviction one may need to del them and/or some segmented memory merges? Doesn't cuda offer async memory ops? Again, I come from the point of curiosity as how such ops can be a bottle neck.

Qubitium avatar Mar 08 '24 10:03 Qubitium

I agree with your point. As I don't have any more details about this case, this is just my guess. The real bottleneck of this throughout dropping can be anywhere and we'll need detail logs or reproducible example to dive into it.

comaniac avatar Mar 08 '24 16:03 comaniac

This issue has been automatically closed due to inactivity. Please feel free to reopen it if needed.

github-actions[bot] avatar Jul 25 '24 06:07 github-actions[bot]