jiwer icon indicating copy to clipboard operation
jiwer copied to clipboard

RemoveSpecificWords is not functioning as expected

Open lisalipani opened this issue 5 years ago • 7 comments

Hi! As the title says, the RemoveSpecificWords function does not work as I would expect it to. As an example, the following code

text = "he asked a helpful question"
    stop_words = ['a', 'he']
    print(jiwer.Compose([
        jiwer.ToLowerCase(),
        jiwer.RemoveMultipleSpaces(),
        jiwer.Strip(),
        jiwer.SentencesToListOfWords(),
        jiwer.RemoveEmptyStrings(),
        jiwer.RemovePunctuation(),
        jiwer.RemoveSpecificWords(stop_words),
    ])(text))

returns

['', 'sked', '', 'lpful', 'question']

Is there a way to get this function recognize word boundaries? Thank you!

lisalipani avatar Jul 29 '20 19:07 lisalipani

Change the order of the compose:

text = "he asked a helpful question"
    stop_words = ['a', 'he']
    print(jiwer.Compose([
        jiwer.ToLowerCase(),
        jiwer.Strip(),
        jiwer.RemoveSpecificWords(stop_words),
        jiwer.RemoveMultipleSpaces(),
        jiwer.RemovePunctuation(),
        jiwer.RemoveEmptyStrings(),
        jiwer.SentencesToListOfWords(),
    ])(text))

nikvaessen avatar Jul 30 '20 07:07 nikvaessen

Also, you might want to use SubstituteWords instead.

nikvaessen avatar Jul 30 '20 07:07 nikvaessen

I have an issue with 'asked' becoming 'sked' when removing the word 'a' -- the order of operations doesn't change that. One way to fix it is to split on a word delimiter, filter out words, and return joined string. Let me know if you're open to a PR to fix this.

elgeish avatar Mar 14 '21 07:03 elgeish

I have an issue with 'asked' becoming 'sked' when removing the word 'a' -- the order of operations doesn't change that. One way to fix it is to split on a word delimiter, filter out words, and return joined string. Let me know if you're open to a PR to fix this.

Can you give a code example?

nikvaessen avatar Mar 16 '21 06:03 nikvaessen

I have an issue with 'asked' becoming 'sked' when removing the word 'a' -- the order of operations doesn't change that. One way to fix it is to split on a word delimiter, filter out words, and return joined string. Let me know if you're open to a PR to fix this.

Can you give a code example?

Yes, it's the example you gave above.

elgeish avatar Mar 16 '21 07:03 elgeish

def process_string(self, s: str):
      for w in self.tokens_to_remove:
          s = s.replace(w, self.replace_token)
      return s

It should be regex exact match to remove, instead of removing partial of the strings

qingjing1018 avatar Jun 07 '21 18:06 qingjing1018

def process_string(self, s: str):
      for w in self.tokens_to_remove:
          s = s.replace(w, self.replace_token)
      return s

It should be regex exact match to remove, instead of removing partial of the strings

Exactly. Currently, it removes all occurrences of given strings, INCLUDING AS SUBSTRINGS in other words. The example given in the top README doesn't produce the output stated there. It produces [' wesome', ' pple is not per', ''], not ["awesome", "apple is pear", ""] Changing the order of other operations doesn't change the result. This is clearly a bug. A quick work around can be to use SubstituteWords, by calling it with a dictionary where unwanted words are mapped to empty string.

akpeker avatar Dec 14 '21 15:12 akpeker

Thanks for spotting this! Should be fixed in the 2.5.0 release.

nikvaessen avatar Sep 03 '22 09:09 nikvaessen