outlines Update create_states_mapping function to include tokenizer parameter

Fixes #872

May 07 '24 11:05 br3no

The class inherits Hashable https://github.com/outlines-dev/outlines/blob/4f8433d8d6633b0780c3a6c27981f9adffbe49f5/outlines/models/tokenizer.py#L7

But yeah, a test would be good. Is there somewhere a documentation for contributors that could help me setup the dev env?

May 07 '24 18:05 br3no

The class inherits Hashable

https://github.com/outlines-dev/outlines/blob/4f8433d8d6633b0780c3a6c27981f9adffbe49f5/outlines/models/tokenizer.py#L7

But yeah, a test would be good.

Yeah, we have some tests for in-session hash consistency (e.g. here), but I recall some issues with cross-session/serialization consistency. That might not be the case anymore, though.

Is there somewhere a documentation for contributors that could help me setup the dev env?

https://outlines-dev.github.io/outlines/community/contribute/

May 07 '24 18:05 brandonwillard

Passing the tokenizer directly to cache key will not work since each instance of the same tokenizer object will be hashed to different values so there will be cache miss. I didnt see this PR earlier so I raised a PR with my local fix in #876 which caches based on tokenizer name or path string.

May 07 '24 23:05 ekagra-ranjan

I've changed the PR to explicitly set the cache key computation using the key_function parameter of the @cache() decorator. This is in line with the discussion in #876.

@brandonwillard what do you think?

May 10 '24 10:05 br3no

I've changed the PR to explicitly set the cache key computation using the key_function parameter of the @cache() decorator. This is in line with the discussion in #876.

@brandonwillard what do you think?

We need tests that confirm that the hashing function works as intended.

May 10 '24 12:05 brandonwillard

@brandonwillard I have added unit tests for the cache decorator when used in conjunction with the key_function. Let me know if this is sufficient for you.

May 10 '24 15:05 br3no

We need to confirm that this choice of key will work between sessions and across different tokenizer types (at the very least the transformer types). Since you're using all the members/fields of the generic Tokenizer interface, it's probably fine;

The key_function will return the fields that, together, "define" a particular tokenizer (these fields are all used in later code, so we can safely assume that it's okay to use them for this purpose). Assuming the hash is computed correctly, I believe we can also safely assume that the caching will work since we verified in a unit test, that the key_function is used the way it should.

if not, we need to update the interface and/or force subclasses to implement this key function themselves.

I agree this would be the best solution, but this would require a large refactoring. The different integrations work differently with the tokenizers. The only implementation of the Tokenizer protocol class is https://github.com/outlines-dev/outlines/blob/97ec37d9038750101152582e5df3d7315b2759b5/outlines/models/transformers.py#L57 For vLLM the tokenizer is patched with the right methods and members. In this case, we would need to also patch the hashing function into the tokenizer. I think this is not something we should do.

Speaking of which, we should first consider/try updating the __hash__ and __eq__ methods of Tokenizerto follow this key.

The situation described above is also the reason why this is, unfortunately, no solution at the moment.

I haven't looked into the matter in depth, but I believe there are some refactoring opportunities to consolidate the usage of tokenizers, aligned with what is done in the transformers integration. If all integrations would provide implementations of the Tokenizer protocol, then we could make the hash computation much simpler and also easier to test.

I opened the original issue because I realized that the cache uses only the regex as key, leading to errors because different tokenizers share the same state machine. Since the cache is persistent, I believe many users will face this issue. The symptoms of this problem are hard to read; the LLMs will generate gibberish.

I think the right thing to do now would be to merge this PR to make sure people don't get into this problem. This is not a new feature, it's a bug. And then open a new issue to address the refactoring needs described above and maybe clean up the change introduced in this PR.

May 11 '24 08:05 br3no

@brandonwillard great work on #911. If I read it correctly, there is no longer a key_function argument in the cache decorator, right? So the tests here don't make sense if #911 is merged.

I don't mind closing this PR; I'd just like to have the behavior fixed. #911 seems like a better increment than this PR here.

May 23 '24 08:05 br3no