outlines
outlines copied to clipboard
Unify Tokenizer Behavior and Ensure Sane Interfaces
What behavior of the library made you think about the improvement?
@brandonwillard and I were looking into the LlamaCppTokenizer and noticed a number of issues:
- It's not made obvious that
__getstate__is used to serialize for hashing. LlamaCppTokenizerandTransformerTokenizerare subclasses of outlinesTokenizer, but vLLM is notLlamaCppTokenizer.__init__doesn't loadspecial_tokensvLLMandtransformerstokenizers useadapt_tokenizer, butllamacppdoesn't.- Tokenizers are intended to be immutable, but that isn't programmatically guaranteed.
- The
__hash__and_stablehash(serialized)are calculated once per call rather than caching their hash value.
How would you like it to behave?
A lot of minor changes here. Please let me know if I'm missing something or if I've accidentally excluded something.
__getstate__is a fallback foroutlines.caching, and by default we implement_stablehashvLLMbecomes an outlinesTokenizerand uses the standard interfaces.- Good parameterized tests for all three tokenizers
- outlines
Tokenizermutation is disabled adapt_tokenizeris removed. Allmodelspass themselves to their respectiveTokenizerto be constructed._stablehashand__hash__are only calculated once.- llamacpp tokenizer should have identical "batch decoding" behavior to the other tokens. link
Some of the work to fix this can be resurrected from https://github.com/outlines-dev/outlines/pull/676
Status
On hold until ExLlamaV2 integration is complete (https://github.com/outlines-dev/outlines/issues/807)