question_generation icon indicating copy to clipboard operation
question_generation copied to clipboard

Avoid ValueError: substring not found

Open Yueeeeeeee opened this issue 4 years ago • 4 comments

in some cases, answers can't be found in the input text and ValueError would appear, add try except to avoid such errors.

Yueeeeeeee avatar Jan 23 '21 16:01 Yueeeeeeee

in my case, substring was not found because ans were padded (like Ans Entity). Strangly, this error was only encountered when I do this using jupyter, when I do it from terminal, no such error was found.

violetcodes avatar Apr 16 '21 07:04 violetcodes

Yes please! I have found the same error but hadn't fully worked out why just yet.

Here is a minimal example:

from pipelines import pipeline

# load in the multi task qa qg
MODEL = pipeline("multitask-qa-qg")

# problem text
text = 'The herb is generally safe to use. There is limited research to suggest that stinging nettle is an effective remedy. Researchers need to do more studies before they can confirm the health benefits of stinging nettle.'

MODEL(text)

Full stack trace:


---------------------------------------------------------------------------

ValueError                                Traceback (most recent call last)

<ipython-input-59-1ab007d28390> in <module>()
      7 text = 'The herb is generally safe to use. There is limited research to suggest that stinging nettle is an effective remedy. Researchers need to do more studies before they can confirm the health benefits of stinging nettle.'
      8 
----> 9 MODEL(text)

2 frames

/content/question_generation/pipelines.py in _prepare_inputs_for_qg_from_answers_hl(self, sents, answers)
    140                 answer_text = answer_text.strip()
    141 
--> 142                 ans_start_idx = sent.index(answer_text)
    143 
    144                 sent = f"{sent[:ans_start_idx]} <hl> {answer_text} <hl> {sent[ans_start_idx + len(answer_text): ]}"

ValueError: substring not found

matt-mkidd-ko avatar Apr 20 '21 17:04 matt-mkidd-ko

Yes please! I have found the same error but hadn't fully worked out why just yet.

Here is a minimal example:

from pipelines import pipeline

# load in the multi task qa qg
MODEL = pipeline("multitask-qa-qg")

# problem text
text = 'The herb is generally safe to use. There is limited research to suggest that stinging nettle is an effective remedy. Researchers need to do more studies before they can confirm the health benefits of stinging nettle.'

MODEL(text)

Full stack trace:

---------------------------------------------------------------------------

ValueError                                Traceback (most recent call last)

<ipython-input-59-1ab007d28390> in <module>()
      7 text = 'The herb is generally safe to use. There is limited research to suggest that stinging nettle is an effective remedy. Researchers need to do more studies before they can confirm the health benefits of stinging nettle.'
      8 
----> 9 MODEL(text)

2 frames

/content/question_generation/pipelines.py in _prepare_inputs_for_qg_from_answers_hl(self, sents, answers)
    140                 answer_text = answer_text.strip()
    141 
--> 142                 ans_start_idx = sent.index(answer_text)
    143 
    144                 sent = f"{sent[:ans_start_idx]} <hl> {answer_text} <hl> {sent[ans_start_idx + len(answer_text): ]}"

ValueError: substring not found

In this specific case, I found out that the error occurred because in sentence "Researchers need to do more studies before they can confirm the health benefits of stinging nettle.", the generated answer is "Do more studies" instead of "do more studies", in ans_start_idx = sent.index(answer_text) (line 142), this index function is case-sensitive, so indexing "Do more studies" will give you this value error.

Since the T5 model is uncased anyway, a simple solution would be replacing line 137 and line 140 in pipelines.py respectively with: sent = sents[i].lower() answer_text = answer_text.strip().lower()

This should solve your problem :)

Yueeeeeeee avatar Apr 20 '21 18:04 Yueeeeeeee

Yes please! I have found the same error but hadn't fully worked out why just yet. Here is a minimal example:

from pipelines import pipeline

# load in the multi task qa qg
MODEL = pipeline("multitask-qa-qg")

# problem text
text = 'The herb is generally safe to use. There is limited research to suggest that stinging nettle is an effective remedy. Researchers need to do more studies before they can confirm the health benefits of stinging nettle.'

MODEL(text)

Full stack trace:

---------------------------------------------------------------------------

ValueError                                Traceback (most recent call last)

<ipython-input-59-1ab007d28390> in <module>()
      7 text = 'The herb is generally safe to use. There is limited research to suggest that stinging nettle is an effective remedy. Researchers need to do more studies before they can confirm the health benefits of stinging nettle.'
      8 
----> 9 MODEL(text)

2 frames

/content/question_generation/pipelines.py in _prepare_inputs_for_qg_from_answers_hl(self, sents, answers)
    140                 answer_text = answer_text.strip()
    141 
--> 142                 ans_start_idx = sent.index(answer_text)
    143 
    144                 sent = f"{sent[:ans_start_idx]} <hl> {answer_text} <hl> {sent[ans_start_idx + len(answer_text): ]}"

ValueError: substring not found

In this specific case, I found out that the error occurred because in sentence "Researchers need to do more studies before they can confirm the health benefits of stinging nettle.", the generated answer is "Do more studies" instead of "do more studies", in ans_start_idx = sent.index(answer_text) (line 142), this index function is case-sensitive, so indexing "Do more studies" will give you this value error.

Since the T5 model is uncased anyway, a simple solution would be replacing line 137 and line 140 in pipelines.py respectively with: sent = sents[i].lower() answer_text = answer_text.strip().lower()

This should solve your problem :)

The error was mainly because of the occurrence of the "<pad>" token at the beginning of some answers. Due to which the index of the answer couldn't be found in "sent".

So I added the following line at 141 to remove the token from the answer:

answer_text = re.sub("<pad> | <pad>", "", answer_text)

Post this addition, the code has been working on all the example that I've seen so far.

Cheers!

mukulmalik18 avatar May 19 '21 06:05 mukulmalik18