Auto-CORPus icon indicating copy to clipboard operation
Auto-CORPus copied to clipboard

Dead code/bug in `Abbreviations.__conditions`

Open alexdewar opened this issue 9 months ago • 1 comments

The Abbreviations class has a method which looks like this:

    def __conditions(self, candidate):
        r"""Based on Schwartz&Hearst.

        2 <= len(str) <= 10
        len(tokens) <= 2
        re.search(r'\p{L}', str)
        str[0].isalnum()

        and extra:
        if it matches (\p{L}\.?\s?){2,}
        it is a good candidate.

        :param candidate: candidate abbreviation
        :return: True if this is a good candidate
        """
        viable = True
        if re2.match(r"(\p{L}\.?\s?){2,}", candidate.lstrip()):
            viable = True
        if len(candidate) < 2 or len(candidate) > 10:
            viable = False
        if len(candidate.split()) > 2:
            viable = False
        if candidate.islower():  # customize function discard all lower case candidate
            viable = False
        if not re2.search(r"\p{L}", candidate):  # \p{L} = All Unicode letter
            viable = False
        if not candidate[0].isalnum():
            viable = False

        return viable

Aside from the odd style (it would be better to have a return False after each if statement, because if a check fails there's no point in doing the others), the first if statement doesn't actually do anything: if the regex matches then viable is set to True, but viable is already True. It doesn't set viable to False if there's no match, meaning the check doesn't actually do anything. It looks like the code's had this problem since it was first introduced in b9b2a0c4.

It does look like most of the checks that that regex is supposed to do are implemented below it though (e.g. that it's at least two chars long), so optimistically it's maybe just redundant rather than a bug. It would be good to know exactly what the function is supposed to do so that we can make sure it's doing this (we could even write tests!). Do we want to copy Schwartz & Hearst's method exactly or do something else?

I'm also not sure about this line:

        if not re2.search(r"\p{L}", candidate):  # \p{L} = All Unicode letter

My regex knowledge isn't great, but surely this would just match any non-empty string? If so, this is redundant too.

alexdewar avatar Mar 14 '25 15:03 alexdewar

Tentatively tagging this as bug.

alexdewar avatar Mar 14 '25 16:03 alexdewar