vllm
vllm copied to clipboard
Change FLASHINFER_WORKSPACE_BUFFER_SIZE to be configurable by envvar
Summary: We start to see error on not enough allocated workplace buffer in flsahinfer when running long context with large bs. Add option to tune the buffer by envvar
Test Plan:
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
VLLM_DISABLE_COMPILE_CACHE=1 \
VLLM_GPU_MEMORY_UTILIZATION=0.90 \
VLLM_USE_V1=1 \
FLASHINFER_WORKSPACE_BUFFER_SIZE=$((1024*1024*1024))
buck2 run @//mode/opt \
-m ovr_config//triton:trunk \
-c fbcode.enable_vllm=true \
-c fbcode.enable_gpu_sections=true \
-c fbcode.platform010_cuda_version=12.8 \
-c fbcode.nvcc_arch=h100a \
//smart/inference_platform_sp/llm_predictor_gpu:service -- \
--local_cache_dir "/data/local/models/Qwen3-VL-235B-A22B-Thinking" \
--try_local_cache \
--thrift_server_port 12345 \
--max_seq_len=65536 \
--max_num_batched_tokens 65536 \
--max_concurrent_requests_multiplier=2 \
--max_batch_size=64 \
--enable_warmup=true \
--model_mf_bucket=llm_inference \
--model_mf_path=tree/oss/Qwen3-VL-235B-A22B-Thinking \
--force_llm_format=true \
--allow_custom_stop_tokens \
--model_parallel_size 8 \
--enable-expert-parallel \
--vllm_engine \
2>&1 | tee "/tmp/$USER/server_vllm.log"
Differential Revision: D83693814
This pull request has merge conflicts that must be resolved before it can be merged. Please rebase the PR, @zzh142857.
https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork
This pull request has merge conflicts that must be resolved before it can be merged. Please rebase the PR, @zzh142857.
https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork