jiwer RemoveSpecificWords is not functioning as expected

Hi! As the title says, the RemoveSpecificWords function does not work as I would expect it to. As an example, the following code

text = "he asked a helpful question"
    stop_words = ['a', 'he']
    print(jiwer.Compose([
        jiwer.ToLowerCase(),
        jiwer.RemoveMultipleSpaces(),
        jiwer.Strip(),
        jiwer.SentencesToListOfWords(),
        jiwer.RemoveEmptyStrings(),
        jiwer.RemovePunctuation(),
        jiwer.RemoveSpecificWords(stop_words),
    ])(text))

returns

['', 'sked', '', 'lpful', 'question']

Is there a way to get this function recognize word boundaries? Thank you!

Jul 29 '20 19:07 lisalipani

Change the order of the compose:

text = "he asked a helpful question"
    stop_words = ['a', 'he']
    print(jiwer.Compose([
        jiwer.ToLowerCase(),
        jiwer.Strip(),
        jiwer.RemoveSpecificWords(stop_words),
        jiwer.RemoveMultipleSpaces(),
        jiwer.RemovePunctuation(),
        jiwer.RemoveEmptyStrings(),
        jiwer.SentencesToListOfWords(),
    ])(text))

Jul 30 '20 07:07 nikvaessen

Also, you might want to use SubstituteWords instead.

Jul 30 '20 07:07 nikvaessen

I have an issue with 'asked' becoming 'sked' when removing the word 'a' -- the order of operations doesn't change that. One way to fix it is to split on a word delimiter, filter out words, and return joined string. Let me know if you're open to a PR to fix this.

Mar 14 '21 07:03 elgeish

I have an issue with 'asked' becoming 'sked' when removing the word 'a' -- the order of operations doesn't change that. One way to fix it is to split on a word delimiter, filter out words, and return joined string. Let me know if you're open to a PR to fix this.

Can you give a code example?

Mar 16 '21 06:03 nikvaessen

I have an issue with 'asked' becoming 'sked' when removing the word 'a' -- the order of operations doesn't change that. One way to fix it is to split on a word delimiter, filter out words, and return joined string. Let me know if you're open to a PR to fix this.

Can you give a code example?

Yes, it's the example you gave above.

Mar 16 '21 07:03 elgeish

def process_string(self, s: str):
      for w in self.tokens_to_remove:
          s = s.replace(w, self.replace_token)
      return s

It should be regex exact match to remove, instead of removing partial of the strings

Jun 07 '21 18:06 qingjing1018

def process_string(self, s: str):
      for w in self.tokens_to_remove:
          s = s.replace(w, self.replace_token)
      return s
It should be regex exact match to remove, instead of removing partial of the strings

Exactly. Currently, it removes all occurrences of given strings, INCLUDING AS SUBSTRINGS in other words. The example given in the top README doesn't produce the output stated there. It produces [' wesome', ' pple is not per', ''], not ["awesome", "apple is pear", ""] Changing the order of other operations doesn't change the result. This is clearly a bug. A quick work around can be to use SubstituteWords, by calling it with a dictionary where unwanted words are mapped to empty string.

Dec 14 '21 15:12 akpeker

Thanks for spotting this! Should be fixed in the 2.5.0 release.

Sep 03 '22 09:09 nikvaessen