FastChat icon indicating copy to clipboard operation
FastChat copied to clipboard

WARNING: tokenization mismatch 139 vs. 141

Open ScottishFold007 opened this issue 2 years ago • 8 comments
trafficstars

Hello! During training, the data processing always reports the following error, what could be the reason?And what are the possible consequences?

image

ScottishFold007 avatar Apr 19 '23 10:04 ScottishFold007

I have the same error as you, whether you used the author's cleaning code and then generated sharegpt_split.json, and reported the error during training with sharegpt_split.json data?

Hi-archers avatar Apr 21 '23 04:04 Hi-archers

Please try to pull the latest version of Fastchat and the latest version of our weights.

zhisbug avatar Apr 22 '23 02:04 zhisbug

yes

ScottishFold007 avatar Apr 22 '23 04:04 ScottishFold007

I have the same error as you, whether you used the author's cleaning code and then generated sharegpt_split.json, and reported the error during training with sharegpt_split.json data?

后面你有解决吗?这个问题我不知道是出在哪里?之前的版本是没这个mismatch的问题的

ScottishFold007 avatar Apr 22 '23 05:04 ScottishFold007

I have the same error as you, whether you used the author's cleaning code and then generated sharegpt_split.json, and reported the error during training with sharegpt_split.json data?

后面你有解决吗?这个问题我不知道是出在哪里?之前的版本是没这个mismatch的问题的

I solved this problem with this code. https://github.com/lm-sys/FastChat/pull/537#issue-1677943634 This may be helpful to you.

Hi-archers avatar Apr 22 '23 05:04 Hi-archers

I have the same error as you, whether you used the author's cleaning code and then generated sharegpt_split.json, and reported the error during training with sharegpt_split.json data?

后面你有解决吗?这个问题我不知道是出在哪里?之前的版本是没这个mismatch的问题的

I solved this problem with this code. #537 (comment) This may be helpful to you.

image 这个2048的参数是不是应该跟着max_length变动?

ScottishFold007 avatar Apr 22 '23 05:04 ScottishFold007

I have the same error as you, whether you used the author's cleaning code and then generated sharegpt_split.json, and reported the error during training with sharegpt_split.json data?

后面你有解决吗?这个问题我不知道是出在哪里?之前的版本是没这个mismatch的问题的

I solved this problem with this code. #537 (comment) This may be helpful to you.

image 这个2048的参数是不是应该跟着max_length变动?

Yes!

Hi-archers avatar Apr 22 '23 06:04 Hi-archers

I also encountered this problem (i.e. total_len == curr_len + 2). Further debugging reveals that this is caused by a mis-tokenization of </s>.

Missing tokenizer_config.json and special_tokens_map.json in the converted model directory may be the cause of the wrong config. Following tokenizer-issues, I update these files from hugging face repo, and the problem is fixed.

Btlmd avatar Apr 24 '23 16:04 Btlmd