outlines Enum type with Non-ASCII character not working properly

Describe the issue as clearly as possible:

When using JSON-structured generation enum type with Non-ASCII character not working properly. Non-ASCII characters(like Chinese) will be force to encode to ASCII characters. This behavior leads to much slower generation speed and much worse performance.

The example code in the below section won't cause an error. It's just an example to debug. It's more clear to check the direct output of the LLM model. For example the line #225 in api.py 1723437077555

In this example , Expected output is '开心' which equals to 2 token_ids, the actual output is "\u5f00\u5fc3" which equals to 14 token_ids. The expected regex_str is '\{[ ]?"心情"[ ]?:[ ]?("开心"|"难过"|"普通")[ ]?\}'. The actual regex_str is '\{[ ]?"心情"[ ]?:[ ]?("\\u5f00\\u5fc3"|"\\u96be\\u8fc7"|"\\u666e\\u901a")[ ]?\}'

Althogh after format_sequence the output seems to become correct characters again. This behavior is absolutly not correct. Two reasons:

The token length increases from 2 to 14. Time cost increases
LLM models are not trained to sample a string like "\\u5f00\\u5fc3". It's garuanteed to be out of distribution. Result in more hallucinations

quick fix: https://github.com/outlines-dev/outlines/blob/5e8f7709e3cecd02943120ed01420f00159cedbc/outlines/fsm/json_schema.py#L275 replace this line with choices.append(re.escape(json.dumps(choice, ensure_ascii=False))) will fix this problem but i dont know will this cause any other problems.

Steps/code to reproduce the bug:

class Emotion(str, Enum):
    happy = "开心"
    upset = "难过"
    normal = "普通"


class PersonInfo(BaseModel):
    心情: Emotion

model = outlines.models.transformers("/root/Phi-3-mini-128k-instruct",device='cuda:0')
query = "How do you feel today?"
print(query)

generator = outlines.generate.json(model, json.dumps(PersonInfo.model_json_schema(), ensure_ascii=False))
ret = generator(query)
print(ret)

Expected result:

'\\{[ ]?"心情"[ ]?:[ ]?("开心"|"难过"|"普通")[ ]?\\}'
['{ "心情" :"普通" }']

Error message:

No response

Outlines/Python version information:

Version information

``` 0.0.46 Python 3.10.14 (main, May 6 2024, 19:42:50) [GCC 11.2.0] aiohappyeyeballs==2.3.5 aiohttp==3.10.2 aiosignal==1.3.1 annotated-types==0.7.0 anyio==4.4.0 async-timeout==4.0.3 attrs==24.2.0 certifi==2024.7.4 charset-normalizer==3.3.2 click==8.1.7 cloudpickle==3.0.0 cmake==3.30.2 datasets==2.20.0 dill==0.3.8 diskcache==5.6.3 distro==1.9.0 exceptiongroup==1.2.2 fastapi==0.112.0 filelock==3.15.4 frozenlist==1.4.1 fsspec==2024.5.0 h11==0.14.0 httpcore==1.0.5 httptools==0.6.1 httpx==0.27.0 huggingface-hub==0.24.5 idna==3.7 interegular==0.3.3 Jinja2==3.1.4 jiter==0.5.0 jsonschema==4.23.0 jsonschema-specifications==2023.12.1 lark==1.1.9 llvmlite==0.43.0 lm-format-enforcer==0.10.3 MarkupSafe==2.1.5 mpmath==1.3.0 msgpack==1.0.8 multidict==6.0.5 multiprocess==0.70.16 nest-asyncio==1.6.0 networkx==3.3 ninja==1.11.1.1 numba==0.60.0 numpy==1.26.4 nvidia-cublas-cu12==12.1.3.1 nvidia-cuda-cupti-cu12==12.1.105 nvidia-cuda-nvrtc-cu12==12.1.105 nvidia-cuda-runtime-cu12==12.1.105 nvidia-cudnn-cu12==9.1.0.70 nvidia-cufft-cu12==11.0.2.54 nvidia-curand-cu12==10.3.2.106 nvidia-cusolver-cu12==11.4.5.107 nvidia-cusparse-cu12==12.1.0.106 nvidia-ml-py==12.555.43 nvidia-nccl-cu12==2.20.5 nvidia-nvjitlink-cu12==12.6.20 nvidia-nvtx-cu12==12.1.105 openai==1.40.2 outlines==0.0.46 packaging==24.1 pandas==2.2.2 pillow==10.4.0 prometheus-fastapi-instrumentator==7.0.0 prometheus_client==0.20.0 protobuf==5.27.3 psutil==6.0.0 py-cpuinfo==9.0.0 pyairports==2.1.1 pyarrow==17.0.0 pyarrow-hotfix==0.6 pycountry==24.6.1 pydantic==2.8.2 pydantic_core==2.20.1 python-dateutil==2.9.0.post0 python-dotenv==1.0.1 pytorch-fast-transformers==0.4.0 pytz==2024.1 PyYAML==6.0.2 pyzmq==26.1.0 ray==2.34.0 referencing==0.35.1 regex==2024.7.24 requests==2.32.3 rpds-py==0.20.0 safetensors==0.4.4 sentencepiece==0.2.0 six==1.16.0 sniffio==1.3.1 starlette==0.37.2 sympy==1.13.1 tiktoken==0.7.0 tokenizers==0.19.1 torch==2.4.0 torchvision==0.19.0 tqdm==4.66.5 transformers==4.44.0 triton==3.0.0 typing_extensions==4.12.2 tzdata==2024.1 urllib3==2.2.2 uvicorn==0.30.5 uvloop==0.19.0 vllm==0.5.4 vllm-flash-attn==2.6.1 watchfiles==0.23.0 websockets==12.0 xformers==0.0.27.post2 xxhash==3.4.1 yarl==1.9.4 ```

Context for the issue:

Reduce inference speed Reduce LLM performance

Aug 12 '24 05:08 duming

Thanks for the well documented issue!

This appears to be an issue with our enum handling in json_schema.py, specifically calling json.dumps

https://github.com/outlines-dev/outlines/blob/60e89f5706e3d0f9837e271e04a39fb6e81d92df/outlines/fsm/json_schema.py#L275

>>> PersonInfo.model_json_schema()
{'$defs': {'Emotion': {'enum': ['开心', '难过', '普通'], 'title': 'Emotion', 'type': 'string'}}, 'properties': {'心情': {'$ref': '#/$defs/Emotion'}}, 'required': ['心情'], 'title': 'PersonInfo', 'type': 'object'}

>>> s = '\{[ ]?"心情"[ ]?:[ ]?("开心"|"难过"|"普通")[ ]?\}'
>>> json.dumps(s)
'"\\\\{[ ]?\\"\\u5fc3\\u60c5\\"[ ]?:[ ]?(\\"\\u5f00\\u5fc3\\"|\\"\\u96be\\u8fc7\\"|\\"\\u666e\\u901a\\")[ ]?\\\\}"'

Aug 12 '24 19:08 lapp0

Thank you very much for replying so quickly. I was wondering if you have any plans to address it soon?

Aug 13 '24 11:08 duming

Hi @duming! This has been fixed in Outlines v1. Here's your example with the new syntax and the output I get:

from enum import Enum
from pydantic import BaseModel

from outlines import from_transformers
from outlines.types.dsl import JsonSchema
import transformers


class Emotion(str, Enum):
    happy = "开心"
    upset = "难过"
    normal = "普通"

class PersonInfo(BaseModel):
    心情: Emotion


model = from_transformers(
    transformers.AutoModelForCausalLM.from_pretrained("facebook/opt-125m"),
    transformers.AutoTokenizer.from_pretrained("facebook/opt-125m")
)

# Just to inspect the output schema, not needed to generate the output
output_schema = JsonSchema(PersonInfo).schema
print(output_schema) # {"$defs": {"Emotion": {"enum": ["\u5f00\u5fc3", "\u96be\u8fc7", "\u666e\u901a"], "title": "Emotion", "type": "string"}}, "properties": {"\u5fc3\u60c5": {"$ref": "#/$defs/Emotion"}}, "required": ["\u5fc3\u60c5"], "title": "PersonInfo", "type": "object"}

ret = model("How do you feel today?", PersonInfo, max_new_tokens=100)
print(ret) # {"心情" :"开心" }

Feel free to reopen the issue if you still run into problems

Jun 23 '25 15:06 RobinPicard

outlines outlines copied to clipboard

Enum type with Non-ASCII character not working properly

Describe the issue as clearly as possible:

Steps/code to reproduce the bug:

Expected result:

Error message:

Outlines/Python version information:

Context for the issue:

outlines
outlines copied to clipboard