mrc-for-flat-nested-ner icon indicating copy to clipboard operation
mrc-for-flat-nested-ner copied to clipboard

换成自己的数据集报错,不能训练

Open gjy-code opened this issue 3 years ago • 14 comments

我用自己的mrc格式数据集报错 Traceback (most recent call last): File "/home/amax/work/gjy/mrc-for-flat-nested-ner-master/train/mrc_ner_trainer.py", line 430, in main() File "/home/amax/work/gjy/mrc-for-flat-nested-ner-master/train/mrc_ner_trainer.py", line 417, in main trainer.fit(model) File "/home/amax/py36env/lib/python3.6/site-packages/pytorch_lightning/trainer/states.py", line 48, in wrapped_fn result = fn(self, *args, **kwargs) File "/home/amax/py36env/lib/python3.6/site-packages/pytorch_lightning/trainer/trainer.py", line 1046, in fit self.accelerator_backend.train(model) File "/home/amax/py36env/lib/python3.6/site-packages/pytorch_lightning/accelerators/ddp_backend.py", line 57, in train self.ddp_train(process_idx=self.task_idx, mp_queue=None, model=model) File "/home/amax/py36env/lib/python3.6/site-packages/pytorch_lightning/accelerators/ddp_backend.py", line 224, in ddp_train results = self.trainer.run_pretrain_routine(model) File "/home/amax/py36env/lib/python3.6/site-packages/pytorch_lightning/trainer/trainer.py", line 1239, in run_pretrain_routine self.train() File "/home/amax/py36env/lib/python3.6/site-packages/pytorch_lightning/trainer/training_loop.py", line 394, in train self.run_training_epoch() File "/home/amax/py36env/lib/python3.6/site-packages/pytorch_lightning/trainer/training_loop.py", line 479, in run_training_epoch enumerate(_with_is_last(train_dataloader)), "get_train_batch" File "/home/amax/py36env/lib/python3.6/site-packages/pytorch_lightning/profiler/profilers.py", line 78, in profile_iterable value = next(iterator) File "/home/amax/py36env/lib/python3.6/site-packages/pytorch_lightning/trainer/training_loop.py", line 1323, in _with_is_last for val in it: File "/home/amax/py36env/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 435, in next data = self._next_data() File "/home/amax/py36env/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 475, in _next_data data = self._dataset_fetcher.fetch(index) # may raise StopIteration File "/home/amax/py36env/lib/python3.6/site-packages/torch/utils/data/_utils/fetch.py", line 44, in fetch data = [self.dataset[idx] for idx in possibly_batched_index] File "/home/amax/py36env/lib/python3.6/site-packages/torch/utils/data/_utils/fetch.py", line 44, in data = [self.dataset[idx] for idx in possibly_batched_index] File "/home/amax/work/gjy/mrc-for-flat-nested-ner-master/datasets/mrc_ner_dataset.py", line 96, in getitem new_end_positions = [origin_offset2token_idx_end[end] for end in end_positions] File "/home/amax/work/gjy/mrc-for-flat-nested-ner-master/datasets/mrc_ner_dataset.py", line 96, in new_end_positions = [origin_offset2token_idx_end[end] for end in end_positions]

KeyError: 46

gjy-code avatar Dec 30 '21 13:12 gjy-code

你好,解决了吗,我用自己的数据集也不行

Josson avatar Feb 11 '22 14:02 Josson

哈哈哈 ,我也是这个问题,解决了吗?我感觉是中文的问题。

guantao18 avatar Feb 25 '22 11:02 guantao18

@guantao18 应该是中文混着英文或者其他字符的问题

Josson avatar Feb 25 '22 13:02 Josson

@Josson 是的,这是bert的wordpiece导致的问题,英文和数字bert是按照最长匹配的,如果标注不是按照这个原则标的话就会导致分词前后的pos错位。解决办法是把标注数据按照bert分词规则重新分一遍做标注或者给带字母或数字的前面都加#,这样可以训练但不知道会不会引起新的问题。

guantao18 avatar Feb 28 '22 03:02 guantao18

@guantao18 请问有没有遇到运行mrc-ner的脚本时必须用0块显卡的问题啊,该怎么解决?

Josson avatar Feb 28 '22 07:02 Josson

@Josson 脚本中不指定显卡id,直接删除掉,程序会自动找可用显卡的。要是不用多卡训练就设置参数gpus="1"即可。

guantao18 avatar Mar 01 '22 06:03 guantao18

@guantao18 应该是中文混着英文或者其他字符的问题

你好,请问现在你解决了吗,是怎么解决的?

gjy-code avatar Mar 01 '22 07:03 gjy-code

@Josson 是的,这是bert的wordpiece导致的问题,英文和数字bert是按照最长匹配的,如果标注不是按照这个原则标的话就会导致分词前后的pos错位。解决办法是把标注数据按照bert分词规则重新分一遍做标注或者给带字母或数字的前面都加#,这样可以训练但不知道会不会引起新的问题。

你好,请问你解决了吗?

gjy-code avatar Mar 01 '22 07:03 gjy-code

@gjy-code 我换了一种tokenizer的方式,不用wordpiece,可以运行了

Josson avatar Mar 01 '22 08:03 Josson

@guantao18 我删掉之后,0号卡被别人用了还是报gpu0内存不够的错

Josson avatar Mar 01 '22 08:03 Josson

@Josson 把max-length改小一些,200以下;还不行就减小batch-size。还不行就等别人用完。

guantao18 avatar Mar 02 '22 11:03 guantao18

@Josson 你用的哪个tokenizer

bannima avatar Aug 31 '22 10:08 bannima

如果是英文的文本的话可以在预处理的时候将多个空格都处理成一个空格:# here !!! fix the problem

        merged_multi_span_data = []
        for p in data['data'][0]['paragraphs']:
            for ques in p['qas']:
                p['context'] = " ".join(p['context'].split())   # here !!!
                current_example = {"id": len(merged_multi_span_data) + 1, "query": ques['question'],
                                   "context": p['context'], "start_position": [], "end_position": [],
                                   "span_position": [], "is_impossible": False}
                for ans in ques['answers']:
                    ans['text'] = " ".join(ans['text'].split())  # here !!!
                    ans_tokens = ans['text'].lower().split()
                    context_tokens = p['context'].lower().split()
                    ans_text = " ".join(ans_tokens)
                    context_text = " ".join(context_tokens)

                    start = p['context'][:context_text.index(ans_text)].count(" ")
                    end = start + ans['text'].count(" ")
                    current_example['start_position'].append(start)
                    current_example['end_position'].append(end)
                    current_example['span_position'].append("{};{}".format(start, end))

                merged_multi_span_data.append(current_example)

liulizuel avatar Oct 19 '22 02:10 liulizuel