outlines
outlines copied to clipboard
Tokenizer Decode Integrity Testing
What behavior of the library made you think about the improvement?
For structured generation, creating a "state to legal token set" index involves determining the string each token_id will decode to.
This is handled in convert_token_to_string where tokenizer.convert_tokens_to_string is called, and then SPIECE_UNDERLINE is stripped.
While this approach works in most cases, it doesn't handle all edge cases, which can lead to errors such as
- https://github.com/dottxt-ai/outlines/issues/1178 (likely related)
- https://github.com/dottxt-ai/outlines/issues/1032 (potentially related)
How would you like it to behave?
Create tests to ensure OutlinesTokenizer.decode(token_id_seq) produces results consistent with original_tokenizer.decode(token_id_seq).
Note: The OutlinesTokenizer will be introduced as part of the tokenizer unification effort, tracked in https://github.com/dottxt-ai/outlines/issues/936