BERT-NER-Pytorch ner_seq.py", line 146, in convert_examples_to_features assert len(label_ids) == max_seq

why I use my own data has error:

File "BERT-NER-Pytorch-master/processors/ner_seq.py", line 146, in convert_examples_to_features assert len(label_ids) == max_seq_length AssertionError

Jun 09 '21 02:06 tianke0711

有一些字符像‘’ 比如 x  x x 0 0 x 无法tokenize 这怎么处理

Jun 09 '21 05:06 tianke0711

要么用【unused】替代，要么就直接【unk】

Jun 11 '21 04:06 lonePatient

@lonePatient 谢谢

Jun 11 '21 05:06 tianke0711

非可见字符替换为可见字符即可，我昨天刚遇到这个问题

Jun 30 '21 03:06 jacksonjack001

I add the following code to solve the problem,:

        if len(tokens) != len(label_ids):
            # when the example.text_a contains the special chars, using the tokenizer.tokenize to process,
            # it occurs the problem that the lengths of tokens is not equal to the label_ids.
            # here just ignore this special case.
            jump_count += 1
            logger.info("-> *** len(tokens) != len(label_ids)  ***  <-")
            logger.info(" ex_index = {}, tokens = {} ".format(ex_index, tokens))
            logger.info(" ex_index = {}, label_ids = {} ".format(ex_index, label_ids))
            continue

Sep 07 '21 11:09 lvjiujin

非可见字符替换为可见字符即可，我昨天刚遇到这个问题

It's not a good way to replace one chars to another.

Sep 07 '21 11:09 lvjiujin

要么用【unused】替代，要么就直接【unk】

you'd better not use '[UNK]', because you don't know the accurate position of the '[UNK]', if you must do this, maybe occurs the wrong move position. so you can adopt my approach to solve the problem.

Sep 07 '21 11:09 lvjiujin

I add the following code to solve the problem,:

        if len(tokens) != len(label_ids):
            # when the example.text_a contains the special chars, using the tokenizer.tokenize to process,
            # it occurs the problem that the lengths of tokens is not equal to the label_ids.
            # here just ignore this special case.
            jump_count += 1
            logger.info("-> *** len(tokens) != len(label_ids)  ***  <-")
            logger.info(" ex_index = {}, tokens = {} ".format(ex_index, tokens))
            logger.info(" ex_index = {}, label_ids = {} ".format(ex_index, label_ids))
            continue

May I ask where should this code insert into ?

Sep 24 '21 17:09 lj976264709

又没人确切的解决了这个问题

Apr 06 '22 11:04 Kissingbymodi

找到原因了，数据里边有空格行，标签为0。 tokens里边不包括空格，label_ids里边多一个标签，长度不相等。解决方法： if len(tokens) != len(label_ids): logger.info("-> *** len(tokens) != len(label_ids) *** <-") logger.info(" ex_index = {}, tokens = {} ".format(ex_index, tokens)) logger.info(len(tokens)) logger.info(" ex_index = {}, label_ids = {} ".format(ex_index, label_ids)) logger.info(len(label_ids)) 发现错误数据，修改即可。

Feb 16 '23 10:02 zmz125

添加至141行（四个len上边）

if len(tokens) != len(label_ids):
            logger.info("-> *** len(tokens) != len(label_ids)  ***  <-")
            logger.info(" ex_index = {}, tokens = {} ".format(ex_index, tokens))
            logger.info(len(tokens))
            logger.info(" ex_index = {}, label_ids = {} ".format(ex_index, label_ids))
            logger.info(len(label_ids))

Feb 16 '23 10:02 zmz125

添加至141行（四个len上边）

if len(tokens) != len(label_ids):
            logger.info("-> *** len(tokens) != len(label_ids)  ***  <-")
            logger.info(" ex_index = {}, tokens = {} ".format(ex_index, tokens))
            logger.info(len(tokens))
            logger.info(" ex_index = {}, label_ids = {} ".format(ex_index, label_ids))
            logger.info(len(label_ids))

请问这个输出的结果tokens长度小于label_ids的长度这个是什么原因呢

Feb 22 '23 15:02 gsq47

添加至141行（四个len上边）

if len(tokens) != len(label_ids):
            logger.info("-> *** len(tokens) != len(label_ids)  ***  <-")
            logger.info(" ex_index = {}, tokens = {} ".format(ex_index, tokens))
            logger.info(len(tokens))
            logger.info(" ex_index = {}, label_ids = {} ".format(ex_index, label_ids))
            logger.info(len(label_ids))

请问这个输出的结果tokens长度小于label_ids的长度这个是什么原因呢

你把每个字符和标签映射以后的对比一下，或者直接找到这条数据看看标注的对不对，你这里边31估计是标签O，跟get_labels里return的索引一致

Feb 22 '23 15:02 zmz125

我想请问一下，这个为什么会把标注标签识别出来呢格式都是文字\t标签

Feb 22 '23 15:02 gsq47

我想请问一下，这个为什么会把标注标签识别出来呢格式都是文字\t标签

格式需要统一，要么全是空格分隔，要么全是\t，在代码里修改分隔字符

Feb 22 '23 15:02 zmz125

我想请问一下，这个为什么会把标注标签识别出来呢格式都是文字\t标签

格式需要统一，要么全是空格分隔，要么全是\t，在代码里修改分隔字符

非常感谢！！！

Feb 22 '23 15:02 gsq47

添加至141行（四个len上边）
if len(tokens) != len(label_ids):
            logger.info("-> *** len(tokens) != len(label_ids)  ***  <-")
            logger.info(" ex_index = {}, tokens = {} ".format(ex_index, tokens))
            logger.info(len(tokens))
            logger.info(" ex_index = {}, label_ids = {} ".format(ex_index, label_ids))
            logger.info(len(label_ids))
请问这个输出的结果tokens长度小于label_ids的长度这个是什么原因呢
你把每个字符和标签映射以后的对比一下，或者直接找到这条数据看看标注的对不对，你这里边31估计是标签O，跟get_labels里return的索引一致

这个我没太想明白要怎么处理，我也是遇到了这样的情况，标注是正确的，但是它后面的标注全部是0，也就是说全部替换成X，而这个我理解为他要把每一个扩充为相同长度的语句，方便训练，但我又不知道他为什么要报这个错。

Apr 14 '23 03:04 h83671979

有一些字符像‘’ 比如 x  x x 0 0 x 无法tokenize 这怎么处理

这个倒还好，比如表情之类的评论可以直接删除，但我这边出现的情况是英文字符全部为[unk]该怎么办啊？

Apr 14 '23 03:04 h83671979

有一些字符像‘’ 比如 x  x x 0 0 x 无法tokenize 这怎么处理

这个倒还好，比如表情之类的评论可以直接删除，但我这边出现的情况是英文字符全部为[unk]该怎么办啊？

您好，您现在解决这个问题了吗？可能是这个模型是针对中文的，但是我现在不太清楚在哪解决英文字符训练的问题。

Dec 29 '23 15:12 Violettttee

有一些字符像‘’ 比如 x  x x 0 0 x 无法tokenize 这怎么处理

这个倒还好，比如表情之类的评论可以直接删除，但我这边出现的情况是英文字符全部为[unk]该怎么办啊？

您好，您现在解决这个问题了吗？可能是这个模型是针对中文的，但是我现在不太清楚在哪解决英文字符训练的问题。

您好！这个主要是针对中文的，英文可以自己在tokenizer上进行修改下，或者你参考下另外个仓库代码torchblocks吧。我记得应该是支持的

Jan 05 '24 09:01 lonePatient

BERT-NER-Pytorch BERT-NER-Pytorch copied to clipboard

ner_seq.py", line 146, in convert_examples_to_features assert len(label_ids) == max_seq_length AssertionError

BERT-NER-Pytorch
BERT-NER-Pytorch copied to clipboard