Lu Fang

Results 87 comments of


                                            Lu Fang

[Bug]: Llama 4 EOFError

Is it 40GB A100 or 80GB version? Also A100 doesn't support fp8. Could you confirm you are using meta-llama/Llama-4-Scout-17B-16E-Instruct? Could you download the model with huggingface-cli first locally and try...

[Bug]: Llama 4 EOFError

1. 40GB A100 should require 8 cards to serve bf16 + 16 experts (Llama4 Scout) 2. Yes, vLLM supports fp8, but A100 doesn't. 3. Good 4. Thanks for confirming that.

[Bug]: Llama 4 EOFError

Llama4 requires vllm >= 0.8.3, transformers >= 4.51.0

Add benchmark for DeepGEMM and vLLM Block FP8 Dense GEMM

Does the number reported match the numbers reported in their repo?

Add benchmark for DeepGEMM and vLLM Block FP8 Dense GEMM

Wondering if we can try more shapes provided from their side. Also curious about the Grouped GEMM comparison?

Add benchmark for DeepGEMM and vLLM Block FP8 Dense GEMM

Btw, shall we land this benchmark scripts? We may reuse to expand to other kernel libraries.

[Core] Support inplace model weights loading

Also we should create some e2e example for this optimization in the RL, and this can be done in a follow up PR.

[Core] Support inplace model weights loading

@22quinn some e2e example will be helpful here. :-)

[ROCm] Get rid of RAY_EXPERIMENTAL_NOSET_ROCR_VISIBLE_DEVICES

This PR also did more on the env var management.

[Bug]: Use v1 engine to load lora weights. If tp=1, the step of creating cudagraph will only use cpu. This causes this process to take a very long time. If tp>1, the gpu will be used normally for processing.

cc: @frank-wei, we need to look into this.

‹
1
2
3
4
5
6
7
8
9
›