cdQA IndexError while fine tuning model

Hi,

I am trying to fine tune the Bert model on my custom dataset. I generated the json file from the pdf using df2squad and fed it to the cdqa annotator, added several question answer pairs and generated the new json file.

However, when I try to fine tune using - cdqa_pipeline.fit_reader('cdqa-v1.1.json') I get the following error -

python3.6/site-packages/cdqa/reader/bertqa_sklearn.py", line 190, in read_squad_examples
    answer_offset + answer_length - 1
IndexError: list index out of range

I also tried the fit_transform('cdqa-v1.1.json) method using BertProcessor and still get the same error.

Any idea on what the problem could be ??

Jan 29 '20 20:01 raghavgurbaxani

Might be a problem with the length of your answers in the json dataset. For example, do you have empty answers?

Jan 29 '20 20:01 n0thingLLM

Hi, Here's a part from the json file (generated from the annotator) -

{"question":"how to install Ethernet connector","id":"a9c80a82-04d6-4ef4-9726-f292816f2bcf","answers":[{"answer_start":-1,"text":"Procedure Insert the metal plate"

The answer isnt empty, is there anything wrong with the format ?

I also tried further shortening the answer and generated another json file - now getting the error:


Could not find answer: '' vs. 'Insert the metal plate'
Traceback (most recent call last):
  File "temp.py", line 28, in <module>
    cdqa_pipeline.fit_reader('2.json') #cdqa-v1.1.json
/lib/python3.6/site-packages/cdqa/reader/bertqa_sklearn.py", line 1291, in fit
    train_sampler = RandomSampler(train_data)
/python3.6/site-packages/torch/utils/data/sampler.py", line 94, in __init__
    "value, but got num_samples={}".format(self.num_samples))
ValueError: num_samples should be a positive integer value, but got num_samples=0


:(

Jan 29 '20 20:01 raghavgurbaxani

Hi. Maybe you should remove that one. I already encountered this problem. You can only use the annotator for dataset like SQuAD v. 1.1. It should have a direct answer from the paragraph and not by putting the answer in the answer box..

Feb 11 '20 01:02 tianpaul01