Unify Tokenizer Behavior and Ensure Sane Interfaces

Open lapp0 opened this issue 1 year ago • 0 comments

@brandonwillard and I were looking into the LlamaCppTokenizer and noticed a number of issues:

It's not made obvious that __getstate__ is used to serialize for hashing.
LlamaCppTokenizer and TransformerTokenizer are subclasses of outlines Tokenizer, but vLLM is not
LlamaCppTokenizer.__init__ doesn't load special_tokens
vLLM and transformers tokenizers use adapt_tokenizer, but llamacpp doesn't.
Tokenizers are intended to be immutable, but that isn't programmatically guaranteed.
The __hash__ and _stablehash(serialized) are calculated once per call rather than caching their hash value.

A lot of minor changes here. Please let me know if I'm missing something or if I've accidentally excluded something.

__getstate__ is a fallback for outlines.caching, and by default we implement _stablehash
vLLM becomes an outlines Tokenizer and uses the standard interfaces.
Good parameterized tests for all three tokenizers
outlines Tokenizer mutation is disabled
adapt_tokenizer is removed. All models pass themselves to their respective Tokenizer to be constructed.
_stablehash and __hash__ are only calculated once.
llamacpp tokenizer should have identical "batch decoding" behavior to the other tokens. link

Some of the work to fix this can be resurrected from https://github.com/outlines-dev/outlines/pull/676

Status

On hold until ExLlamaV2 integration is complete (https://github.com/outlines-dev/outlines/issues/807)

Jun 01 '24 00:06 lapp0