cohere-python icon indicating copy to clipboard operation
cohere-python copied to clipboard

Offline tokenization produces empty token_strings

Open yifanmai opened this issue 1 year ago • 3 comments

When I run the script on this doc: https://docs.cohere.com/reference/tokenize

response = co.tokenize(text="tokenize me! :D", model="command")

I get:

tokens=[10002, 2261, 2012, 8, 2792, 43] token_strings=[] meta=None

where token_strings is an empty array, even thought the docs suggests that it should be non-empty. However, if I run:

response = co.tokenize(text="tokenize me! :D", model="command", offline=False)

I get the token_strings as expected:

tokens=[10002, 2261, 2012, 8, 2792, 43] token_strings=['token', 'ize', ' me', '!', ' :', 'D'] meta=ApiMeta(api_version=ApiMetaApiVersion(version='1', is_deprecated=None, is_experimental=None), billed_units=None, tokens=None, warnings=None)

It would be nice if token_strings could be supported for offline tokenization, so that the online and offline behavior is identical. I'll attach a pull request for how this could be done.

yifanmai avatar May 07 '24 22:05 yifanmai

Hi, thanks for catching any discrepancies in documentation, we had updated https://docs.cohere.com/docs/tokens-and-tokenizers#tokenization-in-python-sdk and the release note https://docs.cohere.com/changelog/python-sdk-v520-release.

Do you use the token_strings? I wonder if it would be acceptable to remove them from the network call to achieve identical behaviour.

elaineg avatar May 08 '24 14:05 elaineg

Yes, token_strings removing from the network call would also make things more uniform.

I have a use case that uses token_strings, however, I can work around this issue - I can get the token strings by using the Hugging Face tokenizers library directly with the downloaded tokenizer.json files.

yifanmai avatar May 08 '24 21:05 yifanmai

Another alternative would be to add a parameter that controls whether token_strings are returned (in both the library and the server API).

yifanmai avatar May 08 '24 21:05 yifanmai