Simon Mo

Results 313 comments of Simon Mo

Can you open an RFC for this for design discussion?

Please add this to https://github.com/vllm-project/vllm/blob/main/docs/source/serving/integrations.rst?plain=1 to it's included in docs.

@Alexei-V-Ivanov-AMD please ping me on Slack when this is ready to go. Thank you!

When you initialize the async engine I think it expects to be in an running event loop, not sure why though. If you change the code to ```diff diff --git...

@mmoskal @noamgat @br3no curious about your feedback on this!

One question I have is that can this be implemented using safetensor's partial read? safetensors have all the metadata in headers so you can access the tensors partially

@DarkLight1337 can you help take another look and let me know whether this is mergable?

You can use AsyncLLMEngine to call it asynchronously.

I believe this is similar to #1879. While T4 can run a 7B model, the throughput will be very very low and vLLM will likely perform a lot of eviction...