Lu Fang

Results 87 comments of Lu Fang

Is it 40GB A100 or 80GB version? Also A100 doesn't support fp8. Could you confirm you are using meta-llama/Llama-4-Scout-17B-16E-Instruct? Could you download the model with huggingface-cli first locally and try...

1. 40GB A100 should require 8 cards to serve bf16 + 16 experts (Llama4 Scout) 2. Yes, vLLM supports fp8, but A100 doesn't. 3. Good 4. Thanks for confirming that.

Llama4 requires vllm >= 0.8.3, transformers >= 4.51.0

Does the number reported match the numbers reported in their repo?

Wondering if we can try more shapes provided from their side. Also curious about the Grouped GEMM comparison?

Btw, shall we land this benchmark scripts? We may reuse to expand to other kernel libraries.

Also we should create some e2e example for this optimization in the RL, and this can be done in a follow up PR.

@22quinn some e2e example will be helpful here. :-)

This PR also did more on the env var management.