scispacy icon indicating copy to clipboard operation
scispacy copied to clipboard

Unexpected abbreviation detection behaviour

Open mpetruc opened this issue 2 years ago • 3 comments

Let me start by thanking you for developing and putting out this tremendously helpful framework for biomedical text processing. I (and am sure many others) are deeply grateful for all your effort and creativity. I've been experimenting with the abbreviation_detector module which has been working great. Until i found this situation: Given the sentence: "The thyroid hormone receptor (TR) inhibiting retinoic malate receptor (RMR) isoforms mediate ligand-independent repression." abbreviation_detector finds the following abbreviations: Abbreviation Definition TR (5, 6) thyroid hormone receptor RMR (12, 13) retinoic malate receptor receptor (3, 4) receptor (RMR receptor (10, 11) receptor (RMR

So, the word "receptor" is incorrectly identified as abbreviation. This happens only if there is one single word between "(TR)" and "retinoic". If another token (word, space) is introduced before OR after the separating word (in this case, "inhibiting"), abbreviation_detector works correctly identifying only the 2 abbreviations (TR and RMR).
From my perspective this is totally unexpected. Could this be a bug in the algorithm? or maybe something i'm doing wrong? Thanks a lot m

mpetruc avatar Jul 19 '22 04:07 mpetruc

This appears to be an unfortunate (but fixable) edge case for the algorithm. Basically its matching the opening paren before TR against the closing paren after RMR, and taking everything in between as a candidate long form, and then receptor before (TR) happens to be an acceptable short form for the long form receptor (RMR. Any longer distance between the two parens would have been filtered out, and if receptor didn't happen to match the other receptor it also wouldn't have gotten through, this was right on the edge. Probably we should check and make sure that the parens inside the candidate aren't unbalanced.

dakinggg avatar Jul 19 '22 05:07 dakinggg

Thank you so much for the quick and thoughtful response. Is there anything i can do at this point to help? Filing a bug maybe?

mpetruc avatar Jul 19 '22 13:07 mpetruc

This serves as the bug, thanks! and I'd be happy to review a PR fixing it if you wanted to. Basically the abbreviation detector should not match parentheses that are not matched to each other.

dakinggg avatar Aug 12 '22 05:08 dakinggg