RemoveSpecificWords is not functioning as expected
Hi! As the title says, the RemoveSpecificWords function does not work as I would expect it to. As an example, the following code
text = "he asked a helpful question"
stop_words = ['a', 'he']
print(jiwer.Compose([
jiwer.ToLowerCase(),
jiwer.RemoveMultipleSpaces(),
jiwer.Strip(),
jiwer.SentencesToListOfWords(),
jiwer.RemoveEmptyStrings(),
jiwer.RemovePunctuation(),
jiwer.RemoveSpecificWords(stop_words),
])(text))
returns
['', 'sked', '', 'lpful', 'question']
Is there a way to get this function recognize word boundaries? Thank you!
Change the order of the compose:
text = "he asked a helpful question"
stop_words = ['a', 'he']
print(jiwer.Compose([
jiwer.ToLowerCase(),
jiwer.Strip(),
jiwer.RemoveSpecificWords(stop_words),
jiwer.RemoveMultipleSpaces(),
jiwer.RemovePunctuation(),
jiwer.RemoveEmptyStrings(),
jiwer.SentencesToListOfWords(),
])(text))
Also, you might want to use SubstituteWords instead.
I have an issue with 'asked' becoming 'sked' when removing the word 'a' -- the order of operations doesn't change that.
One way to fix it is to split on a word delimiter, filter out words, and return joined string. Let me know if you're open to a PR to fix this.
I have an issue with
'asked'becoming'sked'when removing the word'a'-- the order of operations doesn't change that. One way to fix it is to split on a word delimiter, filter out words, and return joined string. Let me know if you're open to a PR to fix this.
Can you give a code example?
I have an issue with
'asked'becoming'sked'when removing the word'a'-- the order of operations doesn't change that. One way to fix it is to split on a word delimiter, filter out words, and return joined string. Let me know if you're open to a PR to fix this.Can you give a code example?
Yes, it's the example you gave above.
def process_string(self, s: str):
for w in self.tokens_to_remove:
s = s.replace(w, self.replace_token)
return s
It should be regex exact match to remove, instead of removing partial of the strings
def process_string(self, s: str): for w in self.tokens_to_remove: s = s.replace(w, self.replace_token) return sIt should be regex exact match to remove, instead of removing partial of the strings
Exactly.
Currently, it removes all occurrences of given strings, INCLUDING AS SUBSTRINGS in other words. The example given in the top README doesn't produce the output stated there. It produces [' wesome', ' pple is not per', ''], not ["awesome", "apple is pear", ""]
Changing the order of other operations doesn't change the result.
This is clearly a bug.
A quick work around can be to use SubstituteWords, by calling it with a dictionary where unwanted words are mapped to empty string.
Thanks for spotting this! Should be fixed in the 2.5.0 release.