vllm icon indicating copy to clipboard operation
vllm copied to clipboard

Change FLASHINFER_WORKSPACE_BUFFER_SIZE to be configurable by envvar

Open zzh142857 opened this issue 1 month ago • 2 comments

Summary: We start to see error on not enough allocated workplace buffer in flsahinfer when running long context with large bs. Add option to tune the buffer by envvar

Test Plan:

CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
VLLM_DISABLE_COMPILE_CACHE=1 \
VLLM_GPU_MEMORY_UTILIZATION=0.90 \
VLLM_USE_V1=1 \
FLASHINFER_WORKSPACE_BUFFER_SIZE=$((1024*1024*1024))
buck2 run @//mode/opt \
  -m ovr_config//triton:trunk \
  -c fbcode.enable_vllm=true \
  -c fbcode.enable_gpu_sections=true \
  -c fbcode.platform010_cuda_version=12.8 \
  -c fbcode.nvcc_arch=h100a \
  //smart/inference_platform_sp/llm_predictor_gpu:service -- \
  --local_cache_dir "/data/local/models/Qwen3-VL-235B-A22B-Thinking" \
  --try_local_cache \
  --thrift_server_port 12345 \
  --max_seq_len=65536 \
  --max_num_batched_tokens 65536 \
  --max_concurrent_requests_multiplier=2 \
  --max_batch_size=64 \
  --enable_warmup=true \
  --model_mf_bucket=llm_inference \
  --model_mf_path=tree/oss/Qwen3-VL-235B-A22B-Thinking \
  --force_llm_format=true \
  --allow_custom_stop_tokens \
  --model_parallel_size 8 \
  --enable-expert-parallel \
  --vllm_engine \
  2>&1 | tee "/tmp/$USER/server_vllm.log"

Differential Revision: D83693814

zzh142857 avatar Oct 01 '25 21:10 zzh142857

This pull request has merge conflicts that must be resolved before it can be merged. Please rebase the PR, @zzh142857.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

mergify[bot] avatar Oct 07 '25 22:10 mergify[bot]

This pull request has merge conflicts that must be resolved before it can be merged. Please rebase the PR, @zzh142857.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

mergify[bot] avatar Nov 14 '25 19:11 mergify[bot]