SPMStreamingDetokenizer sometimes outputs incorrect multi-byte characters

Open wil24 opened this issue 1 year ago • 1 comments

Using the microsoft/Phi-3-medium-128k-instruct model, I received incorrect responses for multi-byte characters (commonly seen in Japanese or Chinese), as shown below:

mlx_lm.generate --model microsoft/Phi-3-medium-128k-instruct --prompt "こんにちは。自己紹介をお願いします"
==========
Prompt: <|user|>
こんにちは。自己紹介をお願いします<|end|>
<|assistant|>

こんにちは！ç§の名前はAIアシスタントです。ç§は、あなたの日常生活をサポートし、必要な情報を提ä¾するために設計されました。どうãよろしくおé¡いいたします！<|end|>

This issue can be fixed by setting is_spm_decoder to False and using NaiveStreamingDetokenizer instead of SPMStreamingDetokenizer:

mlx_lm.generate --model microsoft/Phi-3-medium-128k-instruct --prompt "こんにちは。自己紹介をお願いします"
==========
Prompt: <|user|>
こんにちは。自己紹介をお願いします<|end|>
<|assistant|>

こんにちは！私の名前はAIアシスタントです。私は、あなたの日常生活をサポートし、必要な情報を提供するために設計されました。どうぞよろしくお願いいたします！<|end|>

Are there any guidelines or recommendations on which Detokenizer class to use (or settings to apply) to get correct characters?

Jun 01 '24 09:06 wil24

Well the streaming detokenizer and the naive tokenizer should give the same results. For now you can use the naive one until we fix the streaming one. It will be a little slower, but otherwise should work fine.

Jun 01 '24 13:06 awni