Offline tokenization produces empty token_strings
When I run the script on this doc: https://docs.cohere.com/reference/tokenize
response = co.tokenize(text="tokenize me! :D", model="command")
I get:
tokens=[10002, 2261, 2012, 8, 2792, 43] token_strings=[] meta=None
where token_strings is an empty array, even thought the docs suggests that it should be non-empty. However, if I run:
response = co.tokenize(text="tokenize me! :D", model="command", offline=False)
I get the token_strings as expected:
tokens=[10002, 2261, 2012, 8, 2792, 43] token_strings=['token', 'ize', ' me', '!', ' :', 'D'] meta=ApiMeta(api_version=ApiMetaApiVersion(version='1', is_deprecated=None, is_experimental=None), billed_units=None, tokens=None, warnings=None)
It would be nice if token_strings could be supported for offline tokenization, so that the online and offline behavior is identical. I'll attach a pull request for how this could be done.
Hi, thanks for catching any discrepancies in documentation, we had updated https://docs.cohere.com/docs/tokens-and-tokenizers#tokenization-in-python-sdk and the release note https://docs.cohere.com/changelog/python-sdk-v520-release.
Do you use the token_strings? I wonder if it would be acceptable to remove them from the network call to achieve identical behaviour.
Yes, token_strings removing from the network call would also make things more uniform.
I have a use case that uses token_strings, however, I can work around this issue - I can get the token strings by using the Hugging Face tokenizers library directly with the downloaded tokenizer.json files.
Another alternative would be to add a parameter that controls whether token_strings are returned (in both the library and the server API).