vidur
vidur copied to clipboard
Questions on CPU Overheads & KV Cache Aware Routing
Hi, thanks for open-sourcing this project!
I have a couple of questions:
-
Regarding CPU overheads (e.g. scheduling, tokenization, etc) - while they’re mentioned in the documentation, from reading the code, it seems that a request can enter a batch immediately upon arrival. Does the simulator currently model these CPU-related delays? Also, the link in the docs appears to be broken.
-
Are there any plans to continue developing new features in the simulator, such as KV cache-aware routing (like what's in NVIDIA's Dynamo KV Cache Routing?
Thanks!