[Bug] Unrecognized keys in `rope_scaling` for 'rope_type'='yarn': {'original_max_position_embeddings'}
Checklist
- [x] 1. I have searched related issues but cannot get the expected help.
- [x] 2. The bug has not been fixed in the latest version.
- [x] 3. Please note that if the bug-related issue you submitted lacks corresponding environment info and a minimal reproducible demo, it will be challenging for us to reproduce and resolve the issue, reducing the likelihood of receiving feedback.
- [x] 4. If the issue you raised is not a bug but a question, please raise a discussion at https://github.com/sgl-project/sglang/discussions/new/choose Otherwise, it will be closed.
- [x] 5. Please use English, otherwise it will be closed.
Describe the bug
I am using Qwen2.5-72B which suppots positional extrapolation by Yarn through adding config(copied from https://huggingface.co/Qwen/Qwen2.5-72B-Instruct):
{
"rope_scaling": {
"factor": 4.0,
"original_max_position_embeddings": 32768,
"type": "yarn"
}
}
However, this seems not supported by sglang, when I specify context length=128000, I got
ValueError: User-specified context_length (128000) is greater than the derived context_length (32768). This may lead to incorrect model outputs or CUDA errors. Note that the derived context_length may differ from max_position_embeddings in the model's config. To allow overriding this maximum, set the env var SGLANG_ALLOW_OVERWRITE_LONGER_CONTEXT_LEN=1
I am not sure what if SGLANG_ALLOW_OVERWRITE_LONGER_CONTEXT_LEN=1 worked as normal as yarn do. Besides I also get:
Unrecognized keys in `rope_scaling` for 'rope_type'='yarn': {'original_max_position_embeddings'}
Needed some help,this config works well in vllm
Reproduction
.
Environment
Python: 3.12.3 | packaged by conda-forge | (main, Apr 15 2024, 18:38:13) [GCC 12.3.0] CUDA available: True GPU 0,1,2,3: NVIDIA A100-SXM4-80GB GPU 0,1,2,3 Compute Capability: 8.0 CUDA_HOME: /usr/local/cuda NVCC: Cuda compilation tools, release 12.1, V12.1.105 CUDA Driver Version: 470.103.01 PyTorch: 2.5.1+cu121 sglang: 0.4.1.post5 flashinfer: 0.2.0.post1+cu121torch2.4 triton: 3.1.0 transformers: 4.47.1 torchao: 0.7.0 numpy: 1.26.4 aiohttp: 3.11.9 fastapi: 0.115.6 hf_transfer: Module Not Found huggingface_hub: 0.26.3 interegular: 0.3.3 modelscope: Module Not Found orjson: 3.10.12 packaging: 24.0 psutil: 6.1.0 pydantic: 2.10.3 multipart: 0.0.20 zmq: 26.2.0 uvicorn: 0.32.1 uvloop: 0.21.0 vllm: 0.6.4.post1 xgrammar: Module Not Found openai: 1.56.2 anthropic: Module Not Found litellm: Module Not Found decord: Module Not Found NVIDIA Topology: GPU0 GPU1 GPU2 GPU3 mlx5_0 mlx5_1 mlx5_2 mlx5_3 mlx5_4 mlx5_5 mlx5_6 mlx5_7 CPU Affinity NUMA Affinity GPU0 X NV12 NV12 NV12 SYS SYS SYS SYS PXB PXB NODE NODE 48-95,144-191 1 GPU1 NV12 X NV12 NV12 SYS SYS SYS SYS PXB PXB NODE NODE 48-95,144-191 1 GPU2 NV12 NV12 X NV12 SYS SYS SYS SYS NODE NODE PXB PXB 48-95,144-191 1 GPU3 NV12 NV12 NV12 X SYS SYS SYS SYS NODE NODE PXB PXB 48-95,144-191 1 mlx5_0 SYS SYS SYS SYS X PIX NODE NODE SYS SYS SYS SYS mlx5_1 SYS SYS SYS SYS PIX X NODE NODE SYS SYS SYS SYS mlx5_2 SYS SYS SYS SYS NODE NODE X PIX SYS SYS SYS SYS mlx5_3 SYS SYS SYS SYS NODE NODE PIX X SYS SYS SYS SYS mlx5_4 PXB PXB NODE NODE SYS SYS SYS SYS X PIX NODE NODE mlx5_5 PXB PXB NODE NODE SYS SYS SYS SYS PIX X NODE NODE mlx5_6 NODE NODE PXB PXB SYS SYS SYS SYS NODE NODE X PIX mlx5_7 NODE NODE PXB PXB SYS SYS SYS SYS NODE NODE PIX X
Legend:
X = Self SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI) NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU) PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge) PIX = Connection traversing at most a single PCIe bridge NV# = Connection traversing a bonded set of # NVLinks
ulimit soft: 1048576
@zhyncs Who is on the long context part?
Seems related to this PR https://github.com/sgl-project/sglang/pull/757
scaling_factor is set to 1 and lead to wrong context_length
BTW, i also saw "Unrecognized keys in rope_scaling for 'rope_type'='yarn': {'original_max_position_embeddings'}" in vllm. Seems unrelated.
Thanks. We will check and see.
has anyone found a fix for this? I am seeing this too.
I've searched for help.
Seems removing these lines can fix it:
if "original_max_position_embeddings" in rope_scaling:
rope_scaling_factor = 1
@rangehow Could you try this?
if "original_max_position_embeddings" in rope_scaling:
rope_scaling_factor = 1
Also, I raise help from Qwen Team.
Unrecognized keys in rope_scaling for 'rope_type'='yarn': {'original_max_position_embeddings'}
I’ve only encountered this prompt, and there are no errors reported. Therefore, can this prompt be ignored?
I will ask Qwen for this.
Changing key from type to rope_type matches to what transformers expect and it removes warning. But does this confirms that the YaRN is implemented and working?
It seems to be fixed.
Previously, transformers would raise warnings about unrecognized keys despite this being a valid configuration parameter
https://github.com/huggingface/transformers/pull/36877