Yang Yu
Yang Yu
@StellaAthena Also, the NIC is 100G RDMA/RoCE.
@labteral Please take a review
I actually face the same issue last year when using `jq` as a script language in an internal system. I just patch JQ, and manually copy `map` and `slice` instead....
FYI, we actually call `gc.freeze()` after loading the inference model in our online system to reduce GC latency.
The whole point of this issue is, it seems it is not needed by CUDA kernels that the kv cache is contiguous? ---- > because different layers do not necessarily...
> how about performance degradation you observed using multiple memcpy (one per layer) comparing to using a single memcpy for all layers? benchmark code and result in this [gist](https://gist.github.com/reyoung/00c6f9c42f258d800144d1fd0b0bd5df). Even...
> I do think we can make kv-cache storage non-contiguous, and as you mentioned, we just need to change the strides accordingly Maybe I can let our team summit serial...
And btw, Maybe some [pre-commit hooks](https://pre-commit.com/) or contribution guides are needed. Flashinfer uses clang-format for C sources. But pre-commit hook can check `clang-format` version and make code format automatically.
However, when indptr is CPU tensor, an unnecessary CPU -> Device copy will be invoked here https://github.com/flashinfer-ai/flashinfer/blob/d30667b0a23c1cc9135f7557404409ca1a9b9f02/python/flashinfer/prefill.py#L990 It seems that the current API cannot avoid cross device copy.
I found the `kv_indptr` should always be on CPU, and `qo_indptr` is used in both CPU/CUDA. Maybe it is better to only add `cpu_qo_indptr` parameter.