text-generation-inference icon indicating copy to clipboard operation
text-generation-inference copied to clipboard

Option for use_fast tokenizer

Open psinger opened this issue 2 years ago • 7 comments

Feature request

Add an additional option to specify use_fast flag for AutoTokenizer.

Motivation

Some models have slightly different behavior, or buggy versions, of slow or fast tokenizer.

It would be useful to be able to specify this flag when spinning up an endpoint, also allowing to match how the model was trained.

psinger avatar Jun 19 '23 07:06 psinger

Do you have examples of such models?

OlivierDehaene avatar Jun 19 '23 07:06 OlivierDehaene

This one for instance: https://huggingface.co/openlm-research/open_llama_13b

Please note that it is advised to avoid using the Hugging Face fast tokenizer for now, as we’ve observed that the auto-converted fast tokenizer sometimes gives incorrect tokenizations.

In general it is good practice to match inference parameters with training parameters also to avoid any potential unexpected behavior.

psinger avatar Jun 19 '23 07:06 psinger

Please note that it is advised to avoid using the Hugging Face fast tokenizer for now, as we’ve observed that the auto-converted fast tokenizer sometimes gives incorrect tokenizations.

Very actionable issue. I was the one checking everything worked properly. I can tell you it's been tested. Would be nice to know of actual differences to fix.

Having the Fast tokenizer here really helps since we can use it in the Rust router part to get better heuristics for the scheduler.

Narsil avatar Jun 19 '23 08:06 Narsil

@Narsil what exactly has been tested?

I believe you will still depend on the tokenization implementation of custom models, which can have issues with respect to fast vs. slow implementations of the tokenizer.

Is there any downside in giving the end-user the choice of choosing a slow tokenizer?

psinger avatar Jun 19 '23 08:06 psinger

Is there any downside in giving the end-user the choice of choosing a slow tokenizer?

Yes the router cannot detect the number of tokens within queries, which disables a good block of the scheduling part (responsible for making sure you're not going OOM etc... We default back to it, but it's a pretty sizeable downside (just like giving up flash attention)

Narsil avatar Jun 19 '23 09:06 Narsil

This still would happen now if a model does not have fast tokenizer implemented though.

psinger avatar Jun 19 '23 09:06 psinger

@Narsil what exactly has been tested?

@Narsil what exactly has been tested?

I ran with other datasets too, XNLI is the hardest for tokenization because of UTF-8 issues.

There are some differences, mostly when you use many options like adding/removing tokens on the fly. Nothing which should trigger for inference though. (And the issue is more that there is a huge surface in transformers and internal state management is always correct)

Narsil avatar Jun 19 '23 09:06 Narsil

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

github-actions[bot] avatar Jul 19 '24 01:07 github-actions[bot]