FastChat WARNING: tokenization mismatch 139 vs. 141

trafficstars

Hello! During training, the data processing always reports the following error, what could be the reason?And what are the possible consequences?

Apr 19 '23 10:04 ScottishFold007

I have the same error as you, whether you used the author's cleaning code and then generated sharegpt_split.json, and reported the error during training with sharegpt_split.json data?

Apr 21 '23 04:04 Hi-archers

Please try to pull the latest version of Fastchat and the latest version of our weights.

Apr 22 '23 02:04 zhisbug

yes

Apr 22 '23 04:04 ScottishFold007

I have the same error as you, whether you used the author's cleaning code and then generated sharegpt_split.json, and reported the error during training with sharegpt_split.json data?

后面你有解决吗？这个问题我不知道是出在哪里？之前的版本是没这个mismatch的问题的

Apr 22 '23 05:04 ScottishFold007

I have the same error as you, whether you used the author's cleaning code and then generated sharegpt_split.json, and reported the error during training with sharegpt_split.json data?

后面你有解决吗？这个问题我不知道是出在哪里？之前的版本是没这个mismatch的问题的

I solved this problem with this code. https://github.com/lm-sys/FastChat/pull/537#issue-1677943634 This may be helpful to you.

Apr 22 '23 05:04 Hi-archers

I have the same error as you, whether you used the author's cleaning code and then generated sharegpt_split.json, and reported the error during training with sharegpt_split.json data?

后面你有解决吗？这个问题我不知道是出在哪里？之前的版本是没这个mismatch的问题的

I solved this problem with this code. #537 (comment) This may be helpful to you.

这个2048的参数是不是应该跟着max_length变动？

Apr 22 '23 05:04 ScottishFold007

I have the same error as you, whether you used the author's cleaning code and then generated sharegpt_split.json, and reported the error during training with sharegpt_split.json data?

后面你有解决吗？这个问题我不知道是出在哪里？之前的版本是没这个mismatch的问题的

I solved this problem with this code. #537 (comment) This may be helpful to you.

这个2048的参数是不是应该跟着max_length变动？

Yes!

Apr 22 '23 06:04 Hi-archers

I also encountered this problem (i.e. total_len == curr_len + 2). Further debugging reveals that this is caused by a mis-tokenization of </s>.

Missing tokenizer_config.json and special_tokens_map.json in the converted model directory may be the cause of the wrong config. Following tokenizer-issues, I update these files from hugging face repo, and the problem is fixed.

Apr 24 '23 16:04 Btlmd

FastChat FastChat copied to clipboard

WARNING: tokenization mismatch 139 vs. 141

FastChat
FastChat copied to clipboard