Unable to set `dict` environment variables due to missing JSON parsing
I'm trying to adjust the rope_scaling environment variable. It's mentioned in the documentation as possible, but the assignment in engine_args.py parses the environment variable as a string (obviously). This behavior will also be the same for any other engine argument using dict.
As by default a handler is used to proxy the requests to the serverless vLLM instances, a simple command override with vllm serve is not sufficient in this case.
# Example for Qwen/Qwen3-30B-A3B with extended context window
vllm server --rope-scaling '{"rope_type": "yarn", "factor": 4.0, "original_max_position_embeddings": 32768}'
Having the same issue for menlo/jan-nano (obviously, since its just fine-tuned qwen3). Passing in the rope-scaling arg as JSON breaks it.
vllm serve Menlo/Jan-nano-128k \
--host 0.0.0.0 \
--port 1234 \
--enable-auto-tool-choice \
--tool-call-parser hermes \
--rope-scaling '{"rope_type":"yarn","factor":3.2,"original_max_position_embeddings":40960}' --max-model-len 131072
Log output from my runpod serverless instance
cjl5qbnee80y97[info]rope_scaling\n
cjl5qbnee80y97[info]engine.py :170 2025-07-01 20:43:19,240 Error initializing vLLM engine: 1 validation error for ModelConfig\n
cjl5qbnee80y97[info]engine.py :27 2025-07-01 20:43:19,233 Engine args: AsyncEngineArgs(model='Menlo/Jan-nano-128k', served_model_name=None, tokenizer=None, hf_config_path=None, task='auto', skip_tokenizer_init=False, enable_prompt_embeds=False, tokenizer_mode='auto', trust_remote_code=False, allowed_local_media_path='', download_dir='/runpod-volume/.hf-cache', load_format='auto', config_format='auto', dtype='auto', kv_cache_dtype='auto', seed=0, max_model_len=131072, cuda_graph_sizes=[512], distributed_executor_backend=None, pipeline_parallel_size=1, tensor_parallel_size=1, data_parallel_size=1, data_parallel_size_local=None, data_parallel_address=None, data_parallel_rpc_port=None, data_parallel_backend='mp', enable_expert_parallel=False, max_parallel_loading_workers=None, block_size=16, enable_prefix_caching=False, prefix_caching_hash_algo='builtin', disable_sliding_window=False, disable_cascade_attn=False, use_v2_block_manager='true', swap_space=4, cpu_offload_gb=0, gpu_memory_utilization=0.95, max_num_batched_tokens=None, max_num_partial_prefills=1, max_long_partial_prefills=1, long_prefill_token_threshold=0, max_num_seqs=256, max_logprobs=20, disable_log_stats=False, revision=None, code_revision=None, rope_scaling='{"rope_type":"yarn","factor":3.2,"original_max_position_embeddings":40960}', rope_theta=None, hf_token=None, hf_overrides={}, tokenizer_revision=None, quantization=None, enforce_eager=False, max_seq_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config={}, limit_mm_per_prompt={}, mm_processor_kwargs=None, disable_mm_preprocessor_cache=False, enable_lora=False, enable_lora_bias=False, max_loras=1, max_lora_rank=16, fully_sharded_loras=False, max_cpu_loras=None, lora_dtype='auto', lora_extra_vocab_size=256, long_lora_scaling_factors=None, enable_prompt_adapter=False, max_prompt_adapters=1, max_prompt_adapter_token=0, device='auto', num_scheduler_steps=1, multi_step_stream_outputs=True, ray_workers_use_nsight=False, num_gpu_blocks_override=None, num_lookahead_slots=0, model_loader_extra_config={}, ignore_patterns=None, preemption_mode=None, scheduler_delay_factor=0.0, enable_chunked_prefill=None, disable_chunked_mm_input=False, disable_hybrid_kv_cache_manager=False, guided_decoding_backend='outlines', guided_decoding_disable_fallback=False, guided_decoding_disable_any_whitespace=False, guided_decoding_disable_additional_properties=False, logits_processor_pattern=None, speculative_config=None, qlora_adapter_name_or_path=None, show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, disable_async_output_proc=False, scheduling_policy='fcfs', scheduler_cls='vllm.core.scheduler.Scheduler', override_neuron_config={}, override_pooler_config=None, compilation_config={"level":0,"debug_dump_path":"","cache_dir":"","backend":"","custom_ops":[],"splitting_ops":[],"use_inductor":true,"compile_sizes":null,"inductor_compile_config":{"enable_auto_functionalized_v2":false},"inductor_passes":{},"use_cudagraph":true,"cudagraph_num_of_warmups":0,"cudagraph_capture_sizes":null,"cudagraph_copy_inputs":false,"full_cuda_graph":false,"max_capture_size":null,"local_cache_dir":null}, worker_cls='auto', worker_extension_cls='', kv_transfer_config=None, kv_events_config=None, generation_config='auto', enable_sleep_mode=False, override_generation_config={}, model_impl='auto', calculate_kv_scales=False, additional_config={}, enable_reasoning=None, reasoning_parser='', use_tqdm_on_load=True, pt_load_map_location='cpu', enable_multimodal_encoder_data_parallel=False, disable_log_requests=False)\n
I have no clue about python or vllm, wouldn't a
"rope_scaling": json.loads(os.getenv("ROPE_SCALING", "null")), be enough? Looks like the vllm engine always expects a dict?
Just added this to my fork, will test tomorrow if I can convince the Docker image to build
@Code42Cate I guess both of our solutions will work quite fine, mine just adds a bit more predictability. Let's see which way they'll go. Thanks!
Ah yeah I was too lazy to add the other ones, yours is better :D
Guys, could you help me a bit?
I am having trouble doing that, yeah, I see that you guys fixed the code, but what docker image are you guys using to make everything work?
By everything I mean the Rope Scaling thing for Qwen3....
Fork the repository with the fixed code (mine or preferably @FuxMak) and then deploy from your forked repository instead of from their provided Docker image
Thank you! I did that, and even fixed a little bit, but then, it still does not work...
It keeps the Qwen max at 40960...
On Tue, 12 Aug 2025 at 04:20 Jonas Scholz @.***> wrote:
Code42Cate left a comment (runpod-workers/worker-vllm#192) https://github.com/runpod-workers/worker-vllm/issues/192#issuecomment-3178022160
Fork the repository with the fixed code (mine or preferably @FuxMak https://github.com/FuxMak) and then deploy from your forked repository instead of from their provided Docker image
— Reply to this email directly, view it on GitHub https://github.com/runpod-workers/worker-vllm/issues/192#issuecomment-3178022160, or unsubscribe https://github.com/notifications/unsubscribe-auth/AY353WWNHOO4LUYUADTI7CD3NGISXAVCNFSM6AAAAACADDUN2OVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZTCNZYGAZDEMJWGA . You are receiving this because you commented.Message ID: @.***>
What‘s in your logs when you start the pod regarding the rope_scaling parameter? I remember that I had to fiddle around quite a bit until I got this fix and the pod up and running
Thank you guys for helping me out, look:
2025-08-12 00:24:47.232 | info | 2p2cm2rp7hmrkx | INFO 08-12 03:24:47 [parallel_state.py:1065] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0, EP rank 0\n 2025-08-12 00:24:46.703 | info | 2p2cm2rp7hmrkx | INFO 08-12 03:24:46 [cuda.py:327] Using Flash Attention backend.\n 2025-08-12 00:24:46.703 | info | 2p2cm2rp7hmrkx | INFO 08-12 03:24:46 [llm_engine.py:230] Initializing a V0 LLM engine (v0.9.1) with config: model='/runpod-volume/models/summary', speculative_config=None, tokenizer='/runpod-volume/models/summary', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config={}, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=40960, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, decoding_config=DecodingConfig(backend='outlines', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_backend=''), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None), seed=0, served_model_name=/runpod-volume/models/summary, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=False, chunked_prefill_enabled=True, use_async_output_proc=True, pooler_config=None, compilation_config={"level":0,"debug_dump_path":"","cache_dir":"","backend":"","custom_ops":[],"splitting_ops":[],"use_inductor":true,"compile_sizes":[],"inductor_compile_config":{"enable_auto_functionalized_v2":false},"inductor_passes":{},"use_cudagraph":true,"cudagraph_num_of_warmups":0,"cudagraph_capture_sizes":[256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"cudagraph_copy_inputs":false,"full_cuda_graph":false,"max_capture_size":256,"local_cache_dir":null}, use_cached_outputs=False, \n 2025-08-12 00:24:46.703 | info | 2p2cm2rp7hmrkx | INFO 08-12 03:24:46 [config.py:2195] Chunked prefill is enabled with max_num_batched_tokens=2048.\n 2025-08-12 00:24:46.703 | warning | 2p2cm2rp7hmrkx | WARNING 08-12 03:24:46 [arg_utils.py:1479] Chunked prefill is enabled by default for models with max_model_len > 32K. Chunked prefill might not work with some features or models. If you encounter any issues, please disable by launching with --enable-chunked-prefill=False.\n 2025-08-12 00:24:46.702 | warning | 2p2cm2rp7hmrkx | WARNING 08-12 03:24:46 [arg_utils.py:1642] --guided-decoding-backend=outlines is not supported by the V1 Engine. Falling back to V0. \n 2025-08-12 00:24:46.186 | info | 2p2cm2rp7hmrkx | INFO 08-12 03:24:46 [config.py:823] This model supports multiple tasks: {'classify', 'embed', 'generate', 'reward', 'score'}. Defaulting to 'generate'.\n 2025-08-12 00:24:46.186 | info | 2p2cm2rp7hmrkx | INFO 08-12 03:24:38 [config.py:533] Overriding HF config with {'rope_scaling': {'rope_type': 'yarn', 'factor': 2, 'original_max_position_embeddings': 32768}}\n 2025-08-12 00:24:46.186 | info | 2p2cm2rp7hmrkx | engine.py :27 2025-08-12 03:24:38,331 Engine args: AsyncEngineArgs(model='/runpod-volume/models/summary', served_model_name=None, tokenizer=None, hf_config_path=None, task='auto', skip_tokenizer_init=False, enable_prompt_embeds=False, tokenizer_mode='auto', trust_remote_code=False, allowed_local_media_path='', download_dir=None, load_format='auto', config_format='auto', dtype='bfloat16', kv_cache_dtype='auto', seed=0, max_model_len=65536, cuda_graph_sizes=[512], distributed_executor_backend=None, pipeline_parallel_size=1, tensor_parallel_size=1, data_parallel_size=1, data_parallel_size_local=None, data_parallel_address=None, data_parallel_rpc_port=None, data_parallel_backend='mp', enable_expert_parallel=False, max_parallel_loading_workers=None, block_size=16, enable_prefix_caching=False, prefix_caching_hash_algo='builtin', disable_sliding_window=False, disable_cascade_attn=False, use_v2_block_manager='1', swap_space=4, cpu_offload_gb=0, gpu_memory_utilization=0.95, max_num_batched_tokens=None, max_num_partial_prefills=1, max_long_partial_prefills=1, long_prefill_token_threshold=0, max_num_seqs=256, max_logprobs=20, disable_log_stats=False, revision=None, code_revision=None, rope_scaling={'rope_type': 'yarn', 'factor': 2, 'original_max_position_embeddings': 32768}, rope_theta=None, hf_token=None, hf_overrides={}, tokenizer_revision=None, quantization=None, enforce_eager=False, max_seq_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config={}, limit_mm_per_prompt={}, mm_processor_kwargs=None, disable_mm_preprocessor_cache=False, enable_lora=False, enable_lora_bias=False, max_loras=1, max_lora_rank=16, fully_sharded_loras=False, max_cpu_loras=None, lora_dtype='auto', lora_extra_vocab_size=256, long_lora_scaling_factors=None, enable_prompt_adapter=False, max_prompt_adapters=1, max_prompt_adapter_token=0, device='auto', num_scheduler_steps=1, multi_step_stream_outputs=True, ray_workers_use_nsight=False, num_gpu_blocks_override=None, num_lookahead_slots=0, model_loader_extra_config={}, ignore_patterns=None, preemption_mode=None, scheduler_delay_factor=0.0, enable_chunked_prefill=None, disable_chunked_mm_input=False, disable_hybrid_kv_cache_manager=False, guided_decoding_backend='outlines', guided_decoding_disable_fallback=False, guided_decoding_disable_any_whitespace=False, guided_decoding_disable_additional_properties=False, logits_processor_pattern=None, speculative_config=None, qlora_adapter_name_or_path=None, show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, disable_async_output_proc=False, scheduling_policy='fcfs', scheduler_cls='vllm.core.scheduler.Scheduler', override_neuron_config={}, override_pooler_config=None, compilation_config={"level":0,"debug_dump_path":"","cache_dir":"","backend":"","custom_ops":[],"splitting_ops":[],"use_inductor":true,"compile_sizes":null,"inductor_compile_config":{"enable_auto_functionalized_v2":false},"inductor_passes":{},"use_cudagraph":true,"cudagraph_num_of_warmups":0,"cudagraph_capture_sizes":null,"cudagraph_copy_inputs":false,"full_cuda_graph":false,"max_capture_size":null,"local_cache_dir":null}, worker_cls='auto', worker_extension_cls='', kv_transfer_config=None, kv_events_config=None, generation_config='auto', enable_sleep_mode=False, override_generation_config={}, model_impl='auto', calculate_kv_scales=False, additional_config={}, enable_reasoning=None, reasoning_parser='', use_tqdm_on_load=True, pt_load_map_location='cpu', enable_multimodal_encoder_data_parallel=False, disable_log_requests=False)\n
I do start with the model rope scaling, it applies and even mentions that it is overriding the current settings (on config.json)
Then, it initializes a v0 vLLM engine.
it]\n 2025-08-12 00:25:04.108 | info | 2p2cm2rp7hmrkx | \rLoading safetensors checkpoint shards: 50% Completed | 2/4 [00:16<00:16, 8.30s/it]\n 2025-08-12 00:24:55.590 | info | 2p2cm2rp7hmrkx | \rLoading safetensors checkpoint shards: 25% Completed | 1/4 [00:07<00:23, 7.98s/it]\n 2025-08-12 00:24:47.611 | info | 2p2cm2rp7hmrkx | \rLoading safetensors checkpoint shards: 0% Completed | 0/4 [00:00<?, ?it/s]\n 2025-08-12 00:24:47.611 | info | 2p2cm2rp7hmrkx | INFO 08-12 03:24:47 [model_runner.py:1171] Starting to load model /runpod-volume/models/summary...\n 2025-08-12 00:24:47.232 | info | 2p2cm2rp7hmrkx | INFO 08-12 03:24:47 [parallel_state.py:1065] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0, EP rank 0\n 2025-08-12 00:24:46.703 | info | 2p2cm2rp7hmrkx | INFO 08-12 03:24:46 [cuda.py:327] Using Flash Attention backend.\n 2025-08-12 00:24:46.703 | info | 2p2cm2rp7hmrkx | INFO 08-12 03:24:46 [llm_engine.py:230] Initializing a V0 LLM engine (v0.9.1) with config: model='/runpod-volume/models/summary', speculative_config=None, tokenizer='/runpod-volume/models/summary', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config={}, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=40960, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, decoding_config=DecodingConfig(backend='outlines', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_backend=''), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None), seed=0, served_model_name=/runpod-volume/models/summary, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=False, chunked_prefill_enabled=True, use_async_output_proc=True, pooler_config=None, compilation_config={"level":0,"debug_dump_path":"","cache_dir":"","backend":"","custom_ops":[],"splitting_ops":[],"use_inductor":true,"compile_sizes":[],"inductor_compile_config":{"enable_auto_functionalized_v2":false},"inductor_passes":{},"use_cudagraph":true,"cudagraph_num_of_warmups":0,"cudagraph_capture_sizes":[256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"cudagraph_copy_inputs":false,"full_cuda_graph":false,"max_capture_size":256,"local_cache_dir":null}, use_cached_outputs=False, \n 2025-08-12 00:24:46.703 | info | 2p2cm2rp7hmrkx | INFO 08-12 03:24:46 [config.py:2195] Chunked prefill is enabled with max_num_batched_tokens=2048.\n 2025-08-12 00:24:46.703 | warning | 2p2cm2rp7hmrkx | WARNING 08-12 03:24:46 [arg_utils.py:1479] Chunked prefill is enabled by default for models with max_model_len > 32K. Chunked prefill might not work with some features or models. If you encounter any issues, please disable by launching with --enable-chunked-prefill=False.\n
Em ter., 12 de ago. de 2025 às 12:10, Marco Fuchs @.***> escreveu:
FuxMak left a comment (runpod-workers/worker-vllm#192) https://github.com/runpod-workers/worker-vllm/issues/192#issuecomment-3179766794
What‘s in your logs when you start the pod regarding the rope_scaling parameter? I remember that I had to fiddle around quite a bit until I got this fix and the pod up and running
— Reply to this email directly, view it on GitHub https://github.com/runpod-workers/worker-vllm/issues/192#issuecomment-3179766794, or unsubscribe https://github.com/notifications/unsubscribe-auth/AY353WUG5UHTEQBDW75RII33NH7V3AVCNFSM6AAAAACADDUN2OVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZTCNZZG43DMNZZGQ . You are receiving this because you commented.Message ID: @.***>
Guys! Thank you so much, I managed to make it work. So, the thing was, I installed the new version (which unfortunately does not include FuxMak's fork), and just ignored the ROPE_SCALING env, I just set on the model's config.json and set the MAX_SEQ_LEN or MODEL_MAX_LEN to the number that ROPE_SCALE_CONSTANT * ORIGINAL_MAX_EMBEDDINGS