outlines
outlines copied to clipboard
Enum type with Non-ASCII character not working properly
Describe the issue as clearly as possible:
When using JSON-structured generation enum type with Non-ASCII character not working properly. Non-ASCII characters(like Chinese) will be force to encode to ASCII characters. This behavior leads to much slower generation speed and much worse performance.
The example code in the below section won't cause an error. It's just an example to debug.
It's more clear to check the direct output of the LLM model. For example the line #225 in api.py
In this example , Expected output is '开心' which equals to 2 token_ids, the actual output is "\u5f00\u5fc3" which equals to 14 token_ids. The expected regex_str is '\{[ ]?"心情"[ ]?:[ ]?("开心"|"难过"|"普通")[ ]?\}'. The actual regex_str is '\{[ ]?"心情"[ ]?:[ ]?("\\u5f00\\u5fc3"|"\\u96be\\u8fc7"|"\\u666e\\u901a")[ ]?\}'
Althogh after format_sequence the output seems to become correct characters again. This behavior is absolutly not correct. Two reasons:
- The token length increases from 2 to 14. Time cost increases
- LLM models are not trained to sample a string like "\\u5f00\\u5fc3". It's garuanteed to be out of distribution. Result in more hallucinations
quick fix: https://github.com/outlines-dev/outlines/blob/5e8f7709e3cecd02943120ed01420f00159cedbc/outlines/fsm/json_schema.py#L275 replace this line with choices.append(re.escape(json.dumps(choice, ensure_ascii=False))) will fix this problem but i dont know will this cause any other problems.
Steps/code to reproduce the bug:
class Emotion(str, Enum):
happy = "开心"
upset = "难过"
normal = "普通"
class PersonInfo(BaseModel):
心情: Emotion
model = outlines.models.transformers("/root/Phi-3-mini-128k-instruct",device='cuda:0')
query = "How do you feel today?"
print(query)
generator = outlines.generate.json(model, json.dumps(PersonInfo.model_json_schema(), ensure_ascii=False))
ret = generator(query)
print(ret)
Expected result:
'\\{[ ]?"心情"[ ]?:[ ]?("开心"|"难过"|"普通")[ ]?\\}'
['{ "心情" :"普通" }']
Error message:
No response
Outlines/Python version information:
Version information
Context for the issue:
Reduce inference speed Reduce LLM performance
Thanks for the well documented issue!
This appears to be an issue with our enum handling in json_schema.py, specifically calling json.dumps
https://github.com/outlines-dev/outlines/blob/60e89f5706e3d0f9837e271e04a39fb6e81d92df/outlines/fsm/json_schema.py#L275
>>> PersonInfo.model_json_schema()
{'$defs': {'Emotion': {'enum': ['开心', '难过', '普通'], 'title': 'Emotion', 'type': 'string'}}, 'properties': {'心情': {'$ref': '#/$defs/Emotion'}}, 'required': ['心情'], 'title': 'PersonInfo', 'type': 'object'}
>>> s = '\{[ ]?"心情"[ ]?:[ ]?("开心"|"难过"|"普通")[ ]?\}'
>>> json.dumps(s)
'"\\\\{[ ]?\\"\\u5fc3\\u60c5\\"[ ]?:[ ]?(\\"\\u5f00\\u5fc3\\"|\\"\\u96be\\u8fc7\\"|\\"\\u666e\\u901a\\")[ ]?\\\\}"'
Thank you very much for replying so quickly. I was wondering if you have any plans to address it soon?
Hi @duming! This has been fixed in Outlines v1. Here's your example with the new syntax and the output I get:
from enum import Enum
from pydantic import BaseModel
from outlines import from_transformers
from outlines.types.dsl import JsonSchema
import transformers
class Emotion(str, Enum):
happy = "开心"
upset = "难过"
normal = "普通"
class PersonInfo(BaseModel):
心情: Emotion
model = from_transformers(
transformers.AutoModelForCausalLM.from_pretrained("facebook/opt-125m"),
transformers.AutoTokenizer.from_pretrained("facebook/opt-125m")
)
# Just to inspect the output schema, not needed to generate the output
output_schema = JsonSchema(PersonInfo).schema
print(output_schema) # {"$defs": {"Emotion": {"enum": ["\u5f00\u5fc3", "\u96be\u8fc7", "\u666e\u901a"], "title": "Emotion", "type": "string"}}, "properties": {"\u5fc3\u60c5": {"$ref": "#/$defs/Emotion"}}, "required": ["\u5fc3\u60c5"], "title": "PersonInfo", "type": "object"}
ret = model("How do you feel today?", PersonInfo, max_new_tokens=100)
print(ret) # {"心情" :"开心" }
Feel free to reopen the issue if you still run into problems