Harry Mellor comments

Results 298 comments of


                                            Harry Mellor

Fix missing `kv_caches` and `attn_metadata` in `OpenVINOCausalLM`

Is that fix in the PR you linked?

[RFC]: Enable stale bot for issues and PRs

I also support this, in the past I have spent many hours closing 100s of issues but it's not sustainable. I'm happy to be a reviewer on the PR to...

[RFC]: Enable stale bot for issues and PRs

`actions/stale` has an argument to exempt draft PRs from being marked as stale (and therefore closed) https://github.com/actions/stale?tab=readme-ov-file#exempt-draft-pr There is also an argument to exempt specific labels like `keep-open` https://github.com/actions/stale?tab=readme-ov-file#exempt-pr-labels

[RFC]: Enable stale bot for issues and PRs

I agree that a single comment is might make it a little too easy to keep an issue open forever. In the PR you have the workflow running on every...

[RFC]: Enable stale bot for issues and PRs

Actually, how about instead of automatically adding `keep-open` on "unstale" we add an intermediate label such as `unstale` (which would be removed if the issue went stale again). Then people...

[1/n][CI] Load models in CI from S3 instead of HF

Instead of hard-coding S3 paths, what if we used an environment variable (`VLLM_CI` or something) which, if set, will prepend `s3://vllm-ci-model-weights/` to the `model` and set `load_format="runai_streamer"`?

[Usage]: How to get access to scheduler

I believe you are correct that the scheduler lives in the `EngineCore` which is run in background processes. You can access this background process using `lm.llm_engine.engine_core.proc_handle.proc` if that helps?

[Bug]: vllm server bad

Assuming this is a custom chat window, are you constructing `messages` correctly?

For individual inference return expected result and batched inference returns different results for same prompts - Qwen2-VL-7B

It is expected that the outputs could be different if a particular prompt is included in a batch as the floating point arithmetic is different. If this answer isn't satisfactory...

Use math.prod instead of np.prod for trivial ops

Please make sure the failing check is resolve too.