Yang Yu comments

Results 11 comments of


                                            Yang Yu

Parallel all reduce communication and backprop

@StellaAthena Also, the NIC is 100G RDMA/RoCE.

Fix bug `write_config` when self._config_path is None

@labteral Please take a review

normalizeNumbers panic

I actually face the same issue last year when using `jq` as a script language in an internal system. I just patch JQ, and manually copy `map` and `slice` instead....

deploy/runtime: use a background thread to run GC when interpreters aren't executing the forward pass

FYI, we actually call `gc.freeze()` after loading the inference model in our online system to reduce GC latency.

[feature request]: Support moving `num_layers` into a kv cache page (or support non-contiguous kv cache)

The whole point of this issue is, it seems it is not needed by CUDA kernels that the kv cache is contiguous? ---- > because different layers do not necessarily...

[feature request]: Support moving `num_layers` into a kv cache page (or support non-contiguous kv cache)

> how about performance degradation you observed using multiple memcpy (one per layer) comparing to using a single memcpy for all layers? benchmark code and result in this [gist](https://gist.github.com/reyoung/00c6f9c42f258d800144d1fd0b0bd5df). Even...

[feature request]: Support moving `num_layers` into a kv cache page (or support non-contiguous kv cache)

> I do think we can make kv-cache storage non-contiguous, and as you mentioned, we just need to change the strides accordingly Maybe I can let our team summit serial...

[feature request]: Support moving `num_layers` into a kv cache page (or support non-contiguous kv cache)

And btw, Maybe some [pre-commit hooks](https://pre-commit.com/) or contribution guides are needed. Flashinfer uses clang-format for C sources. But pre-commit hook can check `clang-format` version and make code format automatically.

[Feature request] Adding optional `cpu_indptr`/`cpu_qo_indptr` parameter to `plan` method to avoid synchronized device to host copy.

However， when indptr is CPU tensor， an unnecessary CPU -> Device copy will be invoked here https://github.com/flashinfer-ai/flashinfer/blob/d30667b0a23c1cc9135f7557404409ca1a9b9f02/python/flashinfer/prefill.py#L990 It seems that the current API cannot avoid cross device copy.

[Feature request] Adding optional `cpu_indptr`/`cpu_qo_indptr` parameter to `plan` method to avoid synchronized device to host copy.

I found the `kv_indptr` should always be on CPU, and `qo_indptr` is used in both CPU/CUDA. Maybe it is better to only add `cpu_qo_indptr` parameter.