Dead code/bug in `Abbreviations.__conditions`
The Abbreviations class has a method which looks like this:
def __conditions(self, candidate):
r"""Based on Schwartz&Hearst.
2 <= len(str) <= 10
len(tokens) <= 2
re.search(r'\p{L}', str)
str[0].isalnum()
and extra:
if it matches (\p{L}\.?\s?){2,}
it is a good candidate.
:param candidate: candidate abbreviation
:return: True if this is a good candidate
"""
viable = True
if re2.match(r"(\p{L}\.?\s?){2,}", candidate.lstrip()):
viable = True
if len(candidate) < 2 or len(candidate) > 10:
viable = False
if len(candidate.split()) > 2:
viable = False
if candidate.islower(): # customize function discard all lower case candidate
viable = False
if not re2.search(r"\p{L}", candidate): # \p{L} = All Unicode letter
viable = False
if not candidate[0].isalnum():
viable = False
return viable
Aside from the odd style (it would be better to have a return False after each if statement, because if a check fails there's no point in doing the others), the first if statement doesn't actually do anything: if the regex matches then viable is set to True, but viable is already True. It doesn't set viable to False if there's no match, meaning the check doesn't actually do anything. It looks like the code's had this problem since it was first introduced in b9b2a0c4.
It does look like most of the checks that that regex is supposed to do are implemented below it though (e.g. that it's at least two chars long), so optimistically it's maybe just redundant rather than a bug. It would be good to know exactly what the function is supposed to do so that we can make sure it's doing this (we could even write tests!). Do we want to copy Schwartz & Hearst's method exactly or do something else?
I'm also not sure about this line:
if not re2.search(r"\p{L}", candidate): # \p{L} = All Unicode letter
My regex knowledge isn't great, but surely this would just match any non-empty string? If so, this is redundant too.
Tentatively tagging this as bug.