sglang [Feature] tokenizer_manager accept external tokenizer or skip tokenizer init

Motivation

Currently tokenizer_manager only support init tokenizer by transformers utils. But many models are trained with different toeknizer model (e.g minicpm, tiktoken). On another hand, GenerateReqInput supported 'input_ids' which is already tokenized, that means the tokenizer is not needed for generate request.

btw, vllm support skip_tokenizer_init, please consider similar settings for flexibility.

Thanks!

Related resources

No response

Aug 05 '24 03:08 gryffindor-rr

This should be easy to support. Could you give us a specific example or model name that we can run the test on? If the tokenizer is skipped, does it mean the server will accept input_ids and return output_ids without detokenization?

Aug 05 '24 03:08 Ying1123

This should be easy to support. Could you give us a specific example or model name that we can run the test on?

maybe you can try on https://huggingface.co/openbmb/cpm-bee-2b, my tokenizer is similar to it which is an old-style vocab.txt list, which will fail transformers autotokenizer init. or https://huggingface.co/meta-llama/Meta-Llama-3-8B, which used tiktoken tokenizer (not sure if already integrated with transformers though).

If the tokenizer is skipped, does it mean the server will accept input_ids and return output_ids without detokenization?

yes. i can run encode/decode process those ids.

Aug 05 '24 05:08 gryffindor-rr

Sounds good. We will look into this later. If you have bandwidth, contributions are welcome!

Aug 05 '24 06:08 Ying1123

Sounds good. We will look into this later. If you have bandwidth, contributions are welcome!

have tried filing https://github.com/sgl-project/sglang/pull/959, could you help to take a look?

Aug 07 '24 05:08 gryffindor-rr