outlines icon indicating copy to clipboard operation
outlines copied to clipboard

Unify Tokenizer Behavior and Ensure Sane Interfaces

Open lapp0 opened this issue 1 year ago • 0 comments

What behavior of the library made you think about the improvement?

@brandonwillard and I were looking into the LlamaCppTokenizer and noticed a number of issues:

  • It's not made obvious that __getstate__ is used to serialize for hashing.
  • LlamaCppTokenizer and TransformerTokenizer are subclasses of outlines Tokenizer, but vLLM is not
  • LlamaCppTokenizer.__init__ doesn't load special_tokens
  • vLLM and transformers tokenizers use adapt_tokenizer, but llamacpp doesn't.
  • Tokenizers are intended to be immutable, but that isn't programmatically guaranteed.
  • The __hash__ and _stablehash(serialized) are calculated once per call rather than caching their hash value.

How would you like it to behave?

A lot of minor changes here. Please let me know if I'm missing something or if I've accidentally excluded something.

  • __getstate__ is a fallback for outlines.caching, and by default we implement _stablehash
  • vLLM becomes an outlines Tokenizer and uses the standard interfaces.
  • Good parameterized tests for all three tokenizers
  • outlines Tokenizer mutation is disabled
  • adapt_tokenizer is removed. All models pass themselves to their respective Tokenizer to be constructed.
  • _stablehash and __hash__ are only calculated once.
  • llamacpp tokenizer should have identical "batch decoding" behavior to the other tokens. link

Some of the work to fix this can be resurrected from https://github.com/outlines-dev/outlines/pull/676

Status

On hold until ExLlamaV2 integration is complete (https://github.com/outlines-dev/outlines/issues/807)

lapp0 avatar Jun 01 '24 00:06 lapp0