Simon Mo
Simon Mo
@khluu maybe we can set RUN_WHEEL_CHECK to false by default and turn it on in CI only.
^ @DarkLight1337 this might be related to the refactoring?
/gemini review
For more ephemeral conversations, please join the vLLM slack and join #sig-extensible-hardware channel to discussion!
You should save the model into disk with Huggingface format, and vLLM can load it from disk
Q3 Roadmap has been published #20336
@july8023 It should work on 4090, generally the models takes about 600GB memory, then you want about 100-300GB for KV cache so feel free to plan around that. @fsaudm A100s...
The model currently does not support --dtype bfloat16 because it is natively trained in fp8. Can you point me to the bf16 version?
vLLM does support this bf16 model on A100. It looks like the config.json properly removed `quantization_config` so it would already.
Hmmm please can you edit the documentation with gpu example? That's the primary reader's use case