outlines icon indicating copy to clipboard operation
outlines copied to clipboard

Enum type with Non-ASCII character not working properly

Open duming opened this issue 1 year ago • 2 comments

Describe the issue as clearly as possible:

When using JSON-structured generation enum type with Non-ASCII character not working properly. Non-ASCII characters(like Chinese) will be force to encode to ASCII characters. This behavior leads to much slower generation speed and much worse performance.

The example code in the below section won't cause an error. It's just an example to debug. It's more clear to check the direct output of the LLM model. For example the line #225 in api.py 1723437077555

In this example , Expected output is '开心' which equals to 2 token_ids, the actual output is "\u5f00\u5fc3" which equals to 14 token_ids. The expected regex_str is '\{[ ]?"心情"[ ]?:[ ]?("开心"|"难过"|"普通")[ ]?\}'. The actual regex_str is '\{[ ]?"心情"[ ]?:[ ]?("\\u5f00\\u5fc3"|"\\u96be\\u8fc7"|"\\u666e\\u901a")[ ]?\}'

Althogh after format_sequence the output seems to become correct characters again. This behavior is absolutly not correct. Two reasons:

  1. The token length increases from 2 to 14. Time cost increases
  2. LLM models are not trained to sample a string like "\\u5f00\\u5fc3". It's garuanteed to be out of distribution. Result in more hallucinations

quick fix: https://github.com/outlines-dev/outlines/blob/5e8f7709e3cecd02943120ed01420f00159cedbc/outlines/fsm/json_schema.py#L275 replace this line with choices.append(re.escape(json.dumps(choice, ensure_ascii=False))) will fix this problem but i dont know will this cause any other problems.

Steps/code to reproduce the bug:

class Emotion(str, Enum):
    happy = "开心"
    upset = "难过"
    normal = "普通"


class PersonInfo(BaseModel):
    心情: Emotion

model = outlines.models.transformers("/root/Phi-3-mini-128k-instruct",device='cuda:0')
query = "How do you feel today?"
print(query)

generator = outlines.generate.json(model, json.dumps(PersonInfo.model_json_schema(), ensure_ascii=False))
ret = generator(query)
print(ret)

Expected result:

'\\{[ ]?"心情"[ ]?:[ ]?("开心"|"难过"|"普通")[ ]?\\}'
['{ "心情" :"普通" }']

Error message:

No response

Outlines/Python version information:

Version information

``` 0.0.46 Python 3.10.14 (main, May 6 2024, 19:42:50) [GCC 11.2.0] aiohappyeyeballs==2.3.5 aiohttp==3.10.2 aiosignal==1.3.1 annotated-types==0.7.0 anyio==4.4.0 async-timeout==4.0.3 attrs==24.2.0 certifi==2024.7.4 charset-normalizer==3.3.2 click==8.1.7 cloudpickle==3.0.0 cmake==3.30.2 datasets==2.20.0 dill==0.3.8 diskcache==5.6.3 distro==1.9.0 exceptiongroup==1.2.2 fastapi==0.112.0 filelock==3.15.4 frozenlist==1.4.1 fsspec==2024.5.0 h11==0.14.0 httpcore==1.0.5 httptools==0.6.1 httpx==0.27.0 huggingface-hub==0.24.5 idna==3.7 interegular==0.3.3 Jinja2==3.1.4 jiter==0.5.0 jsonschema==4.23.0 jsonschema-specifications==2023.12.1 lark==1.1.9 llvmlite==0.43.0 lm-format-enforcer==0.10.3 MarkupSafe==2.1.5 mpmath==1.3.0 msgpack==1.0.8 multidict==6.0.5 multiprocess==0.70.16 nest-asyncio==1.6.0 networkx==3.3 ninja==1.11.1.1 numba==0.60.0 numpy==1.26.4 nvidia-cublas-cu12==12.1.3.1 nvidia-cuda-cupti-cu12==12.1.105 nvidia-cuda-nvrtc-cu12==12.1.105 nvidia-cuda-runtime-cu12==12.1.105 nvidia-cudnn-cu12==9.1.0.70 nvidia-cufft-cu12==11.0.2.54 nvidia-curand-cu12==10.3.2.106 nvidia-cusolver-cu12==11.4.5.107 nvidia-cusparse-cu12==12.1.0.106 nvidia-ml-py==12.555.43 nvidia-nccl-cu12==2.20.5 nvidia-nvjitlink-cu12==12.6.20 nvidia-nvtx-cu12==12.1.105 openai==1.40.2 outlines==0.0.46 packaging==24.1 pandas==2.2.2 pillow==10.4.0 prometheus-fastapi-instrumentator==7.0.0 prometheus_client==0.20.0 protobuf==5.27.3 psutil==6.0.0 py-cpuinfo==9.0.0 pyairports==2.1.1 pyarrow==17.0.0 pyarrow-hotfix==0.6 pycountry==24.6.1 pydantic==2.8.2 pydantic_core==2.20.1 python-dateutil==2.9.0.post0 python-dotenv==1.0.1 pytorch-fast-transformers==0.4.0 pytz==2024.1 PyYAML==6.0.2 pyzmq==26.1.0 ray==2.34.0 referencing==0.35.1 regex==2024.7.24 requests==2.32.3 rpds-py==0.20.0 safetensors==0.4.4 sentencepiece==0.2.0 six==1.16.0 sniffio==1.3.1 starlette==0.37.2 sympy==1.13.1 tiktoken==0.7.0 tokenizers==0.19.1 torch==2.4.0 torchvision==0.19.0 tqdm==4.66.5 transformers==4.44.0 triton==3.0.0 typing_extensions==4.12.2 tzdata==2024.1 urllib3==2.2.2 uvicorn==0.30.5 uvloop==0.19.0 vllm==0.5.4 vllm-flash-attn==2.6.1 watchfiles==0.23.0 websockets==12.0 xformers==0.0.27.post2 xxhash==3.4.1 yarl==1.9.4 ```

Context for the issue:

Reduce inference speed Reduce LLM performance

duming avatar Aug 12 '24 05:08 duming

Thanks for the well documented issue!

This appears to be an issue with our enum handling in json_schema.py, specifically calling json.dumps

https://github.com/outlines-dev/outlines/blob/60e89f5706e3d0f9837e271e04a39fb6e81d92df/outlines/fsm/json_schema.py#L275

>>> PersonInfo.model_json_schema()
{'$defs': {'Emotion': {'enum': ['开心', '难过', '普通'], 'title': 'Emotion', 'type': 'string'}}, 'properties': {'心情': {'$ref': '#/$defs/Emotion'}}, 'required': ['心情'], 'title': 'PersonInfo', 'type': 'object'}
>>> s = '\{[ ]?"心情"[ ]?:[ ]?("开心"|"难过"|"普通")[ ]?\}'
>>> json.dumps(s)
'"\\\\{[ ]?\\"\\u5fc3\\u60c5\\"[ ]?:[ ]?(\\"\\u5f00\\u5fc3\\"|\\"\\u96be\\u8fc7\\"|\\"\\u666e\\u901a\\")[ ]?\\\\}"'

lapp0 avatar Aug 12 '24 19:08 lapp0

Thank you very much for replying so quickly. I was wondering if you have any plans to address it soon?

duming avatar Aug 13 '24 11:08 duming

Hi @duming! This has been fixed in Outlines v1. Here's your example with the new syntax and the output I get:

from enum import Enum
from pydantic import BaseModel

from outlines import from_transformers
from outlines.types.dsl import JsonSchema
import transformers


class Emotion(str, Enum):
    happy = "开心"
    upset = "难过"
    normal = "普通"

class PersonInfo(BaseModel):
    心情: Emotion


model = from_transformers(
    transformers.AutoModelForCausalLM.from_pretrained("facebook/opt-125m"),
    transformers.AutoTokenizer.from_pretrained("facebook/opt-125m")
)

# Just to inspect the output schema, not needed to generate the output
output_schema = JsonSchema(PersonInfo).schema
print(output_schema) # {"$defs": {"Emotion": {"enum": ["\u5f00\u5fc3", "\u96be\u8fc7", "\u666e\u901a"], "title": "Emotion", "type": "string"}}, "properties": {"\u5fc3\u60c5": {"$ref": "#/$defs/Emotion"}}, "required": ["\u5fc3\u60c5"], "title": "PersonInfo", "type": "object"}

ret = model("How do you feel today?", PersonInfo, max_new_tokens=100)
print(ret) # {"心情" :"开心" }

Feel free to reopen the issue if you still run into problems

RobinPicard avatar Jun 23 '25 15:06 RobinPicard