outlines Tokenizer Decode Integrity Testing

Tokenizer Decode Integrity Testing

Open lapp0 opened this issue 1 year ago • 0 comments

What behavior of the library made you think about the improvement?

For structured generation, creating a "state to legal token set" index involves determining the string each token_id will decode to.

This is handled in convert_token_to_string where tokenizer.convert_tokens_to_string is called, and then SPIECE_UNDERLINE is stripped.

While this approach works in most cases, it doesn't handle all edge cases, which can lead to errors such as

https://github.com/dottxt-ai/outlines/issues/1178 (likely related)
https://github.com/dottxt-ai/outlines/issues/1032 (potentially related)

How would you like it to behave?

Create tests to ensure OutlinesTokenizer.decode(token_id_seq) produces results consistent with original_tokenizer.decode(token_id_seq).

Note: The OutlinesTokenizer will be introduced as part of the tokenizer unification effort, tracked in https://github.com/dottxt-ai/outlines/issues/936

Sep 28 '24 00:09 lapp0

outlines outlines copied to clipboard

Tokenizer Decode Integrity Testing

What behavior of the library made you think about the improvement?

How would you like it to behave?

outlines
outlines copied to clipboard