lmdeploy [Feature] turbomind后端是否会支持guided

Motivation

turbomind后端是否会支持guided_decoding

Related resources

No response

Additional context

No response

Nov 19 '24 08:11 shell-nlp

同问

Nov 20 '24 06:11 sph116

我们先评估下整体工作量，再来和大家同步

Nov 21 '24 02:11 lvhan028

组内现在在优先处理团队内部的一些需求，对这个feature的支持会延后

Nov 26 '24 03:11 lvhan028

这个feature 什么时候可以支持呀

Dec 26 '24 11:12 shell-nlp

怎么也得春节后了。。。

Dec 26 '24 13:12 lvhan028

节后什么时候支持呀

Feb 08 '25 07:02 shell-nlp

很抱歉，变化太快，组内现在没有人手来处理这个需求了

Feb 08 '25 09:02 lvhan028

guided_decoding 是一个非常好用的 Feature，建议优先支持

Feb 28 '25 08:02 shell-nlp

很抱歉，现在组内的工作最高优先级是支持团队内的需求，guided decoding 暂排不上

Mar 03 '25 10:03 lvhan028

@CUHKSZzxy may put it into your work list

Mar 03 '25 10:03 lvhan028

@lvhan028 很多agent框架（比如langchain）都需要使用 guided_decoding技术作为构建agent的一部分，因此再次建议尽早支持一下 guided_decoding，不然在使用agent框架的时候，只能使用vllm作为推理后端才能支持。

Mar 25 '25 13:03 shell-nlp

pytorch engine 支持，能不能先用它。tm 这边因为人手原因，还没有排上这个feature

Mar 25 '25 14:03 lvhan028

最近有计划吗，这个feature 看似还是很容易实现的，而且是很重要的feature

May 12 '25 14:05 shell-nlp

@lvhan028 这个feature排上计划了吗，我觉得性能再好，也不如某些feature在业务上的重要性。而且这个feature看似很容易实现。

Jun 04 '25 04:06 shell-nlp

最近没空处理社区提的feature，非常欢迎社区大佬们给 lmdeploy 提 PR 支持

Jun 04 '25 09:06 lvhan028

求支持啊

Jul 16 '25 13:07 shell-nlp

@lvhan028 这个feature排上计划了吗，我觉得性能再好，也不如某些feature在业务上的重要性。而且这个feature看似很容易实现。

@shell-nlp 抱歉，团队一直处于工作量高度饱和的状态，实在分身乏术。 vllm, sglang等优秀的开源库有这个功能，可以用起来。如果容易实现的话，可以自己动手实现看看。如果能给 lmdeploy PR这个功能，我们非常非常欢迎，深感荣幸。

Jul 16 '25 15:07 lvhan028

你好，我想尝试区支持在turbomind backend下支持guided_decoding，这边可以提供一些指导吗? 我这边初步看了下，通过纯python code 的改动，貌似很难去支持这个feature。

Jul 22 '25 15:07 xiaoajie738

😁再来催一催

Sep 10 '25 09:09 shell-nlp

和 @irexyc 说好了。这把真的排上，拉钩。

Sep 10 '25 11:09 lvhan028

Any WIP guy.

Sep 27 '25 17:09 tuilakhanh

Any WIP guy.

You can try PR #3965 and follow the progress there. Thanks.

Sep 28 '25 01:09 windreamer

You can try PR #3965 and follow the progress there. Thanks.

Thank you for your work. Do you plan to implement response_format for the OpenAI API Server after?

Sep 29 '25 02:09 tuilakhanh

You can try PR #3965 and follow the progress there. Thanks.

Thank you for your work. Do you plan to implement response_format for the OpenAI API Server after?

Thank you for your support and patience. #3965 is just a beginning to better support guided decoding in LMDeploy. We will see if it has enough quality to be merged. And after that, I think we still have some work to optimize the performance. For OpenAI style guided decoding, frankly speaking, we have not discussed on it.

If you have interests to implement it, I believe we will all be happy to accept it!

Sep 29 '25 06:09 windreamer

You can try PR #3965 and follow the progress there. Thanks.

Thank you for your work. Do you plan to implement response_format for the OpenAI API Server after?

After some dig-ups, I think I have enabled response_format for the OpenAI API Server in the last commit of the PR? Maybe you can have a try?

Sep 29 '25 10:09 windreamer

After some dig-ups, I think I have enabled response_format for the OpenAI API Server in the last commit of the PR? Maybe you can have a try?

2025-09-30 15:27:32,716 - lmdeploy - ERROR - async_engine.py:663 - [safe_run] exception caught: KeyError 'schema'
2025-09-30 15:27:32,716 - lmdeploy - ERROR - async_engine.py:648 - [model_inst] exception caught: 'schema'
INFO:     10.69.1.103:56928 - "POST /v1/chat/completions HTTP/1.1" 500 Internal Server Error
ERROR:    Exception in ASGI application
Traceback (most recent call last):
  File "/opt/py3/lib/python3.10/site-packages/uvicorn/protocols/http/h11_impl.py", line 403, in run_asgi
    result = await app(  # type: ignore[func-returns-value]
  File "/opt/py3/lib/python3.10/site-packages/uvicorn/middleware/proxy_headers.py", line 60, in __call__
    return await self.app(scope, receive, send)
  File "/opt/py3/lib/python3.10/site-packages/fastapi/applications.py", line 1133, in __call__
    await super().__call__(scope, receive, send)
  File "/opt/py3/lib/python3.10/site-packages/starlette/applications.py", line 113, in __call__
    await self.middleware_stack(scope, receive, send)
  File "/opt/py3/lib/python3.10/site-packages/starlette/middleware/errors.py", line 186, in __call__
    raise exc
  File "/opt/py3/lib/python3.10/site-packages/starlette/middleware/errors.py", line 164, in __call__
    await self.app(scope, receive, _send)
  File "/opt/py3/lib/python3.10/site-packages/starlette/middleware/cors.py", line 85, in __call__
    await self.app(scope, receive, send)
  File "/opt/py3/lib/python3.10/site-packages/starlette/middleware/exceptions.py", line 63, in __call__
    await wrap_app_handling_exceptions(self.app, conn)(scope, receive, send)
  File "/opt/py3/lib/python3.10/site-packages/starlette/_exception_handler.py", line 53, in wrapped_app
    raise exc
  File "/opt/py3/lib/python3.10/site-packages/starlette/_exception_handler.py", line 42, in wrapped_app
    await app(scope, receive, sender)
  File "/opt/py3/lib/python3.10/site-packages/fastapi/middleware/asyncexitstack.py", line 18, in __call__
    await self.app(scope, receive, send)
  File "/opt/py3/lib/python3.10/site-packages/starlette/routing.py", line 716, in __call__
    await self.middleware_stack(scope, receive, send)
  File "/opt/py3/lib/python3.10/site-packages/starlette/routing.py", line 736, in app
    await route.handle(scope, receive, send)
  File "/opt/py3/lib/python3.10/site-packages/starlette/routing.py", line 290, in handle
    await self.app(scope, receive, send)
  File "/opt/py3/lib/python3.10/site-packages/fastapi/routing.py", line 123, in app
    await wrap_app_handling_exceptions(app, request)(scope, receive, send)
  File "/opt/py3/lib/python3.10/site-packages/starlette/_exception_handler.py", line 53, in wrapped_app
    raise exc
  File "/opt/py3/lib/python3.10/site-packages/starlette/_exception_handler.py", line 42, in wrapped_app
    await app(scope, receive, sender)
  File "/opt/py3/lib/python3.10/site-packages/fastapi/routing.py", line 109, in app
    response = await f(request)
  File "/opt/py3/lib/python3.10/site-packages/fastapi/routing.py", line 387, in app
    raw_response = await run_endpoint_function(
  File "/opt/py3/lib/python3.10/site-packages/fastapi/routing.py", line 288, in run_endpoint_function
    return await dependant.call(**values)
  File "/opt/py3/lib/python3.10/site-packages/lmdeploy/serve/openai/api_server.py", line 586, in chat_completions_v1
    if final_res.finish_reason == 'stop' and len(message.tool_calls) > 0:
AttributeError: 'NoneType' object has no attribute 'finish_reason'

Curl work but when try with python openai lib, lmdeploy raise 500 error. Above is logs.

Sep 30 '25 08:09 tuilakhanh

schema

Can I have your test code？ I need some more info to debug. Thank you.

Sep 30 '25 10:09 windreamer

After some dig-ups, I think I have enabled response_format for the OpenAI API Server in the last commit of the PR? Maybe you can have a try?

2025-09-30 15:27:32,716 - lmdeploy - ERROR - async_engine.py:663 - [safe_run] exception caught: KeyError 'schema'
2025-09-30 15:27:32,716 - lmdeploy - ERROR - async_engine.py:648 - [model_inst] exception caught: 'schema'
INFO:     10.69.1.103:56928 - "POST /v1/chat/completions HTTP/1.1" 500 Internal Server Error
ERROR:    Exception in ASGI application
Traceback (most recent call last):
  File "/opt/py3/lib/python3.10/site-packages/uvicorn/protocols/http/h11_impl.py", line 403, in run_asgi
    result = await app(  # type: ignore[func-returns-value]
  File "/opt/py3/lib/python3.10/site-packages/uvicorn/middleware/proxy_headers.py", line 60, in __call__
    return await self.app(scope, receive, send)
  File "/opt/py3/lib/python3.10/site-packages/fastapi/applications.py", line 1133, in __call__
    await super().__call__(scope, receive, send)
  File "/opt/py3/lib/python3.10/site-packages/starlette/applications.py", line 113, in __call__
    await self.middleware_stack(scope, receive, send)
  File "/opt/py3/lib/python3.10/site-packages/starlette/middleware/errors.py", line 186, in __call__
    raise exc
  File "/opt/py3/lib/python3.10/site-packages/starlette/middleware/errors.py", line 164, in __call__
    await self.app(scope, receive, _send)
  File "/opt/py3/lib/python3.10/site-packages/starlette/middleware/cors.py", line 85, in __call__
    await self.app(scope, receive, send)
  File "/opt/py3/lib/python3.10/site-packages/starlette/middleware/exceptions.py", line 63, in __call__
    await wrap_app_handling_exceptions(self.app, conn)(scope, receive, send)
  File "/opt/py3/lib/python3.10/site-packages/starlette/_exception_handler.py", line 53, in wrapped_app
    raise exc
  File "/opt/py3/lib/python3.10/site-packages/starlette/_exception_handler.py", line 42, in wrapped_app
    await app(scope, receive, sender)
  File "/opt/py3/lib/python3.10/site-packages/fastapi/middleware/asyncexitstack.py", line 18, in __call__
    await self.app(scope, receive, send)
  File "/opt/py3/lib/python3.10/site-packages/starlette/routing.py", line 716, in __call__
    await self.middleware_stack(scope, receive, send)
  File "/opt/py3/lib/python3.10/site-packages/starlette/routing.py", line 736, in app
    await route.handle(scope, receive, send)
  File "/opt/py3/lib/python3.10/site-packages/starlette/routing.py", line 290, in handle
    await self.app(scope, receive, send)
  File "/opt/py3/lib/python3.10/site-packages/fastapi/routing.py", line 123, in app
    await wrap_app_handling_exceptions(app, request)(scope, receive, send)
  File "/opt/py3/lib/python3.10/site-packages/starlette/_exception_handler.py", line 53, in wrapped_app
    raise exc
  File "/opt/py3/lib/python3.10/site-packages/starlette/_exception_handler.py", line 42, in wrapped_app
    await app(scope, receive, sender)
  File "/opt/py3/lib/python3.10/site-packages/fastapi/routing.py", line 109, in app
    response = await f(request)
  File "/opt/py3/lib/python3.10/site-packages/fastapi/routing.py", line 387, in app
    raw_response = await run_endpoint_function(
  File "/opt/py3/lib/python3.10/site-packages/fastapi/routing.py", line 288, in run_endpoint_function
    return await dependant.call(**values)
  File "/opt/py3/lib/python3.10/site-packages/lmdeploy/serve/openai/api_server.py", line 586, in chat_completions_v1
    if final_res.finish_reason == 'stop' and len(message.tool_calls) > 0:
AttributeError: 'NoneType' object has no attribute 'finish_reason'

Curl work but when try with python openai lib, lmdeploy raise 500 error. Above is logs.

I have tested successfully using the following script:

from openai import OpenAI

client = OpenAI(api_key="YOUR_API_KEY", base_url="http://0.0.0.0:23333/v1")

model_name = client.models.list().data[0].id

response = client.chat.completions.create(
    model=model_name,
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Make a self introduction please."},
    ],
    temperature=0.8,
    top_p=0.8,
    response_format={
        "type": "json_schema",
        "json_schema": {
            "name": "user_profile",
            "schema": {
                "type": "object",
                "properties": {
                    "name": {"type": "string"},
                    "skills": {
                        "type": "array",
                        "items": {"type": "string", "maxLength": 10},
                        "minItems": 3,
                        "maxItems": 10,
                    },
                    "work history": {
                        "type": "array",
                        "items": {
                            "type": "object",
                            "properties": {
                                "company": {"type": "string"},
                                "duration": {"type": "string"},
                            },
                            "required": ["company"],
                        },
                    },
                },
                "required": ["name", "skills", "work history"],
            },
        },
    },
)

print(response)

I can get such response:

ChatCompletion(id='8', choices=[Choice(finish_reason='stop', index=0, logprobs=None, message=ChatCompletionMessage(content='{"name": "Alice", "skills": ["HTML", "CSS", "JavaScript", "Python", "SQL", "Git", "Docker", "AWS", "Linux", "ReactJS"], "work history": [{"company": "Company A", "duration": "2020-2023"}, {"company": "Company B", "duration": "2023-2024"}] }', refusal=None, role='assistant', annotations=None, audio=None, function_call=None, tool_calls=None, gen_tokens=None, reasoning_content=None))], created=1759227863, model='/home/windreamer/.cache/huggingface/hub/models--Qwen--Qwen3-0.6B/snapshots/c1899de289a04d12100db370d81485cdf75e47ca', object='chat.completion', service_tier=None, system_fingerprint=None, usage=CompletionUsage(completion_tokens=89, prompt_tokens=25, total_tokens=114, completion_tokens_details=None, prompt_tokens_details=None))

Sep 30 '25 10:09 windreamer

import json
from typing import List

from openai import OpenAI
from pydantic import BaseModel


class StoryOutput(BaseModel):
    title: str
    characters: List[str]
    moral: str


client = OpenAI(
    base_url="",
    api_key="EMPTY",
)

schema = StoryOutput.model_json_schema()

prompt = (
    "Kể một câu chuyện ngắn vui nhộn về một con mèo và một con robot trong công viên."
)

resp = client.chat.completions.create(
    model="",
    messages=[
        {
            "role": "system",
            "content": "Return ONLY valid JSON that matches the JSON Schema.",
        },
        {"role": "user", "content": prompt},
    ],
    response_format={
        "type": "json_schema",
        "json_schema": {
            "name": "StoryOutput",
            "schema": schema,
            "strict": True,
        },
    },
    temperature=0.7,
)

content = resp.choices[0].message.content
data = json.loads(content)
story = StoryOutput.model_validate(data)

Here is my code.

Sep 30 '25 12:09 tuilakhanh

import json from typing import List

from openai import OpenAI from pydantic import BaseModel

class StoryOutput(BaseModel): title: str characters: List[str] moral: str

client = OpenAI( base_url="", api_key="EMPTY", )

schema = StoryOutput.model_json_schema()

prompt = ( "Kể một câu chuyện ngắn vui nhộn về một con mèo và một con robot trong công viên." )

resp = client.chat.completions.create( model="", messages=[ { "role": "system", "content": "Return ONLY valid JSON that matches the JSON Schema.", }, {"role": "user", "content": prompt}, ], response_format={ "type": "json_schema", "json_schema": { "name": "StoryOutput", "schema": schema, "strict": True, }, }, temperature=0.7, )

content = resp.choices[0].message.content data = json.loads(content) story = StoryOutput.model_validate(data) Here is my code.

Thank you for the code for reproduce the issue. I have identify the bug and fix it in the latest commit, you can try to verify if it has been fixed completely.

This is due to the Pydantic model used by LMDeploy, as schema is reseved for Pydantic BaseModel. So we need rename the field to json_schema to avoid name confliction and set an alias of schema to make sure the json is deserialized successfully. However, we also serialize the model to json internally and the result will use json_schema instread of schema as field name and this is the root cause of the bug.

So in the latest commit of the PR, I set the model to use alias in serialization too. I belive it can solve the current issue.

Oct 09 '25 04:10 windreamer

[Feature] turbomind后端是否会支持guided_decoding

Motivation

Related resources

Additional context