text-generation-inference Option for use

Feature request

Add an additional option to specify use_fast flag for AutoTokenizer.

Motivation

Some models have slightly different behavior, or buggy versions, of slow or fast tokenizer.

It would be useful to be able to specify this flag when spinning up an endpoint, also allowing to match how the model was trained.

Jun 19 '23 07:06 psinger

Do you have examples of such models?

Jun 19 '23 07:06 OlivierDehaene

This one for instance: https://huggingface.co/openlm-research/open_llama_13b

Please note that it is advised to avoid using the Hugging Face fast tokenizer for now, as we’ve observed that the auto-converted fast tokenizer sometimes gives incorrect tokenizations.

In general it is good practice to match inference parameters with training parameters also to avoid any potential unexpected behavior.

Jun 19 '23 07:06 psinger

Please note that it is advised to avoid using the Hugging Face fast tokenizer for now, as we’ve observed that the auto-converted fast tokenizer sometimes gives incorrect tokenizations.

Very actionable issue. I was the one checking everything worked properly. I can tell you it's been tested. Would be nice to know of actual differences to fix.

Having the Fast tokenizer here really helps since we can use it in the Rust router part to get better heuristics for the scheduler.

Jun 19 '23 08:06 Narsil

@Narsil what exactly has been tested?

I believe you will still depend on the tokenization implementation of custom models, which can have issues with respect to fast vs. slow implementations of the tokenizer.

Is there any downside in giving the end-user the choice of choosing a slow tokenizer?

Jun 19 '23 08:06 psinger

Is there any downside in giving the end-user the choice of choosing a slow tokenizer?

Yes the router cannot detect the number of tokens within queries, which disables a good block of the scheduling part (responsible for making sure you're not going OOM etc... We default back to it, but it's a pretty sizeable downside (just like giving up flash attention)

Jun 19 '23 09:06 Narsil

This still would happen now if a model does not have fast tokenizer implemented though.

Jun 19 '23 09:06 psinger

@Narsil what exactly has been tested?

@Narsil what exactly has been tested?

I ran with other datasets too, XNLI is the hardest for tokenization because of UTF-8 issues.

There are some differences, mostly when you use many options like adding/removing tokens on the fly. Nothing which should trigger for inference though. (And the issue is more that there is a huge surface in transformers and internal state management is always correct)

Jun 19 '23 09:06 Narsil

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

Jul 19 '24 01:07 github-actions[bot]

text-generation-inference
text-generation-inference copied to clipboard

Option for use_fast tokenizer

Feature request

Motivation

text-generation-inference text-generation-inference copied to clipboard

Option for use_fast tokenizer

Feature request

Motivation

text-generation-inference
text-generation-inference copied to clipboard