rasa Update crf_entity

HI all! I think there's a small typo: probably it should call the method islower() instead of lower(). Thanks!

Proposed changes:

call the method islower() instead of lower(), since it's extracting features (like isdigit(), isupper().. )

Status (please check what you already did):

[ ] added some tests for the functionality
[ ] updated the documentation
[ ] updated the changelog (please check changelog for instructions)
[ ] reformat files using black (please check Readme for instructions)

Jun 09 '22 16:06 leonardofoderaro

All committers have signed the CLA.

Jun 09 '22 16:06 CLAassistant

Hi there! could you kindly tell me if there's anything wrong with my PR? it's a very simple fix, an 'islower()' instead of 'lower()'. If I have to change it (or if it's correct this way) please let me know. Thanks! Leo

Jun 22 '22 16:06 leonardofoderaro

@leonardofoderaro I retracted the approval because although I agree with your initial assessment, it has knock-on consequences for users of the extractor. Up until now, that .LOW feature was a string of lowercased characters. I'm not sure this was the intention, but that's how it's been behaving. If it now becomes a boolean feature (islower), that could change extraction behaviour. I want to further check the history of the current implementation, to make sure it really is a typo, rather than by design.

Jun 23 '22 10:06 indam23

melindaloubser1 thank you for the explanation! Looking at the other features I assumed it should have been a boolean. Please let me know if there's anything else I can do, I'm quite new to these internals and I'd like to understand them better. Thanks!

Jun 23 '22 11:06 leonardofoderaro

@leonardofoderaro I do think your analysis of this is correct and that you actually found a bug in the implementation of the extractor there, or, alternatively, we have a bug in the documentation; see this screenshot from https://rasa.com/docs/rasa/components#crfentityextractor: Screenshot 2022-06-29 at 15 35 41

However, @melindaloubser1 has a point there that changes this may have consequences on running assistants. @amn41 can you chime in here about the risk of fixing this? 🙂

Jun 29 '22 13:06 losterloh

This is definitely a bug and should be fixed 👍

Aug 29 '22 07:08 twerkmeister

@twerkmeister so I can update the branch and resubmit the fix? or it's the documentation to be fixed? Thanks!

Aug 29 '22 10:08 leonardofoderaro

Actually, I was wrong. I did some more investigation and found:

It's been like this since Rasa 1.6 (the oldest branch I could find)
It's done like this in the sklearn-crfsuite tutorial https://sklearn-crfsuite.readthedocs.io/en/latest/tutorial.html , where they call that feature word identity
It wouldn't make much sense to use .islower() looking at our default feature set for the previous and the next word. Having just is lower, is title and is upper wouldn't give you a lot of information about those words.
Also is title and is upper already give you pretty much the information for is lower given that capitalizations rarely happen within a word, which would be the only thing they wouldn't capture.

I think we should change the documentation instead.

Aug 30 '22 07:08 twerkmeister

closing as superseded by #11628

Oct 07 '22 09:10 twerkmeister

rasa
rasa copied to clipboard

Update crf_entity_extractor.py

rasa rasa copied to clipboard

Update crf_entity_extractor.py

rasa
rasa copied to clipboard