guidance
guidance copied to clipboard
Error in program: No valid option generated in #select!
The bug If the option is in Chinese, this error is raised. In fact, not only Chinese, in the non-English context, there may be a large probability of this bug
To Reproduce
import guidance
guidance.llm = guidance.llms.OpenAI("text-davinci-003")
chinese_program = guidance.Program("""哪个水果更甜一点?\n{{#select 'fruit'}}苹果{{or}}橘子{{/select}}""")
chinese_program()
System info (please complete the following information):
- OS (e.g. Ubuntu, Windows 11, Mac OS, etc.): Mac OS
- Guidance Version (
guidance.__version__
): 0.0.57
just like this:
But I found a way to fix the bug, The original code does not take into account the case of the token being bytes. In the case of the token being bytes, openai will add the "bytes:" prefix, like this:
{
"text_offset":[
17
],
"token_logprobs":[
-0.054483417
],
"tokens":[
" "
],
"top_logprobs":[
{
" ":-0.054483417,
"bytes: \\xe6":-2.9369752
}
]
}
I fixed the bug with the following code
# in _select.py
if "logprobs" in gen_obj:
logprobs_result = gen_obj["logprobs"]
# convert the logprobs keys from string back to token ids
top_logprobs = {}
for k,v in logprobs_result["top_logprobs"][0].items():
if k.startswith('bytes:'):
k = k.replace('bytes:', '')
k = k.replace('\\x', '')
k_bytes = bytes.fromhex(k)
if k.startswith(' '):
k_bytes = b' ' + k_bytes
id = parser.program.llm._tokenizer.encode_single_token(k_bytes)
else:
id = parser.program.llm.token_to_id(k)
top_logprobs[id] = v
If possible, I would be happy to contribute a PR for this.
Thanks! Clearly we need some non-english based unit tests. A PR would be much appreciated! (and if you do send in a PR can you also make sure whatever patch also works with the Transformers backend models?
Thanks! Clearly we need some non-english based unit tests. A PR would be much appreciated! (and if you do send in a PR can you also make sure whatever patch also works with the Transformers backend models?
Ok, I'll check the relevant subclasses of _llm.LLM and add the non-english based unit tests.