cdQA
cdQA copied to clipboard
IndexError while fine tuning model
Hi,
I am trying to fine tune the Bert model on my custom dataset. I generated the json file from the pdf using df2squad
and fed it to the cdqa annotator, added several question answer pairs and generated the new json file.
However, when I try to fine tune using - cdqa_pipeline.fit_reader('cdqa-v1.1.json')
I get the following error -
python3.6/site-packages/cdqa/reader/bertqa_sklearn.py", line 190, in read_squad_examples
answer_offset + answer_length - 1
IndexError: list index out of range
I also tried the fit_transform('cdqa-v1.1.json)
method using BertProcessor
and still get the same error.
Any idea on what the problem could be ??
Might be a problem with the length of your answers in the json dataset. For example, do you have empty answers?
Hi, Here's a part from the json file (generated from the annotator) -
{"question":"how to install Ethernet connector","id":"a9c80a82-04d6-4ef4-9726-f292816f2bcf","answers":[{"answer_start":-1,"text":"Procedure Insert the metal plate"
The answer isnt empty, is there anything wrong with the format ?
I also tried further shortening the answer and generated another json file - now getting the error:
Could not find answer: '' vs. 'Insert the metal plate'
Traceback (most recent call last):
File "temp.py", line 28, in <module>
cdqa_pipeline.fit_reader('2.json') #cdqa-v1.1.json
/lib/python3.6/site-packages/cdqa/reader/bertqa_sklearn.py", line 1291, in fit
train_sampler = RandomSampler(train_data)
/python3.6/site-packages/torch/utils/data/sampler.py", line 94, in __init__
"value, but got num_samples={}".format(self.num_samples))
ValueError: num_samples should be a positive integer value, but got num_samples=0
:(
Hi. Maybe you should remove that one. I already encountered this problem. You can only use the annotator for dataset like SQuAD v. 1.1. It should have a direct answer from the paragraph and not by putting the answer in the answer box..