mycroft-core icon indicating copy to clipboard operation
mycroft-core copied to clipboard

Add method to remove any entries from a .voc file from a string.

Open krisgesling opened this issue 2 years ago • 4 comments

Description

The remove_voc method complements the existing vocab matching tools.

It is useful to clean utterances of common articles or other noise when parsing for specific information.

This method is not case sensitive unlike the other vocab matching methods. Is the case sensitivity intentional? Should we make all methods case insensitive?

Context: I was needing this type of method today, then saw the same need had been identified in https://github.com/MycroftAI/mycroft-core/pull/2986 and an existing method to do it already existed in OVOS

How to test

Unit tests included

Contributor license agreement signed?

  • [x] CLA

krisgesling avatar Aug 31 '21 11:08 krisgesling

Voight Kampff Integration Test Succeeded (Results)

devops-mycroft avatar Aug 31 '21 12:08 devops-mycroft

I think I've missed something... where are the other 2 approaches? If there are better options, that's great - let's do that!

I noticed the extraneous spaces thing too but didn't see a reason to remove them. In one way it helps to show that things have been removed, rather than operating on what is now probably at least grammatically incorrect, and potentially has even changed the meaning slightly. This is kind of silly but the best I can come up with at 5am...

cleaned_string = self.remove_voc("john likes the football", "football")
if "john the football" in cleaned_string:
    ...

On the flipside, I could see that extra characters may impact a confidence rating based on keyword frequency etc.

In terms of case sensitivity the docstring says "is not case sensitive" but maybe this should be worded as "is case insensitive" to match the flag.

krisgesling avatar Aug 31 '21 19:08 krisgesling

There were some discussion around the implementation when I tried to point out a bug in #2986. I couldn't find Ken's versions there now but I put up a gist with a collection of the ones I could remember: https://gist.github.com/forslund/c01636a18ddef0a9bef84dae609f4511

The results for 100000 runs of each: str_replace 0.8078192858956754 for_regex 6.423368333023973 single_regex 0.6766295690322295 precalculated_single_regex 0.42126812401693314 list_join 0.4742607999360189 dont_preserve_order 0.4839391199639067

The str_replace() is the implementation chosen in above mentioned PR since it's pretty readable and pretty fast.

for_regex() is basically what is done here and it gets a lot of overhead from the regex compilation. The good thing about of the regex versions is that capitalization of the input string can be kept without extra logic keeping it simple.

list_join() is a really nifty and fast function if I got it right...To handle capitalization it can be changed to

' '.join(w for w in phrase.split() if w.lower() not in words)

The lowering makes it lose about 0.1 seconds in the test case but it's still one of the fastest of the versions above.

Edit: Ken provided the missing version and I have updated with the dont_preserve_order one. Not sure it's ideal since it will rearrange the sentence.

forslund avatar Sep 01 '21 19:09 forslund

Nice comparison - thanks!

I love the simplicity of the list_join comprehension. However I just did some quick testing and there's a few extra pitfalls.

list_join() won't handle multi-word vocab and str_replace() is case-sensitive which I think we should handle. Eg:

remove_voc("Never Forget it", "forget it")  # Never

I've added a "help wanted" flag to see if anyone else has a good suggestion. You can find unit tests in this PR to validate it. Note - I've just removed the spaces that were left in from the original implementation so the unit tests will currently fail.

krisgesling avatar Sep 13 '21 06:09 krisgesling