Simon Mo
Simon Mo
Can you open an RFC for this for design discussion?
Please add this to https://github.com/vllm-project/vllm/blob/main/docs/source/serving/integrations.rst?plain=1 to it's included in docs.
@Alexei-V-Ivanov-AMD please ping me on Slack when this is ready to go. Thank you!
When you initialize the async engine I think it expects to be in an running event loop, not sure why though. If you change the code to ```diff diff --git...
@mmoskal @noamgat @br3no curious about your feedback on this!
One question I have is that can this be implemented using safetensor's partial read? safetensors have all the metadata in headers so you can access the tensors partially
@DarkLight1337 can you help take another look and let me know whether this is mergable?
You can use AsyncLLMEngine to call it asynchronously.
I believe this is similar to #1879. While T4 can run a 7B model, the throughput will be very very low and vLLM will likely perform a lot of eviction...