outlines icon indicating copy to clipboard operation
outlines copied to clipboard

Tokenizer Decode Integrity Testing

Open lapp0 opened this issue 1 year ago • 0 comments

What behavior of the library made you think about the improvement?

For structured generation, creating a "state to legal token set" index involves determining the string each token_id will decode to.

This is handled in convert_token_to_string where tokenizer.convert_tokens_to_string is called, and then SPIECE_UNDERLINE is stripped.

While this approach works in most cases, it doesn't handle all edge cases, which can lead to errors such as

  • https://github.com/dottxt-ai/outlines/issues/1178 (likely related)
  • https://github.com/dottxt-ai/outlines/issues/1032 (potentially related)

How would you like it to behave?

Create tests to ensure OutlinesTokenizer.decode(token_id_seq) produces results consistent with original_tokenizer.decode(token_id_seq).

Note: The OutlinesTokenizer will be introduced as part of the tokenizer unification effort, tracked in https://github.com/dottxt-ai/outlines/issues/936

lapp0 avatar Sep 28 '24 00:09 lapp0