pyahocorasick Is case-insensitive string match supported?

Is case-insensitive string match supported?

Open twang18 opened this issue 3 years ago • 13 comments

Hi, thanks for the great work! I am wondering if case-insensitive string match is supported. For example, when there is "information system" in the built Trie, and it can match "Information system", "Information System" and "INFORMATION SYSTEM" as well?

Nov 05 '20 08:11 twang18

To support case-insensitive matching you should insert lowercase words into the trie and then lowercase your input before you search for it.

Nov 06 '20 02:11 Dobatymo

Thanks for the suggestion! This could be one solution.

Nov 06 '20 02:11 twang18

There's no case insensitive search option and TBH I don't like to add this to the core library. What I may suggest as an enhancement to the library is to provide an extra python wrapper class that would keep the interface, but do lowercase conversion underneath.

Dec 23 '20 11:12 WojciechMula

IMHO case insensitive is something best done outside (that's the way I do this)

Jan 31 '21 09:01 pombredanne

https://github.com/nppoly/cyac#unicode Wrong offset may returns when convert to lowercase outside.

Mar 24 '21 08:03 zhu

@zhu Thanks, that's important. So, a safe way would be to use UCS-32 (4 bytes per code point). Which is not very memory-usage friendly.

Mar 24 '21 15:03 WojciechMula

@zhu Thanks, that's important. So, a safe way would be to use UCS-32 (4 bytes per code point). Which is not very memory-usage friendly.

I think UCS-32 cannot solve it. u"İ".lower() generate 2 unicode code point. And u"İ".lower().upper() != u"İ". There's some special casing: https://stackoverflow.com/questions/23524231/lower-case-of-turkish-character-dotted-i http://www.unicode.org/Public/UNIDATA/SpecialCasing.txt

casing incurs a change in string length or is dependent on context or locale

Mar 26 '21 01:03 zhu

@zhu Thank a lot. It is more than strange. So there is no perfect solution.

Mar 27 '21 19:03 WojciechMula

I actually also experienced this same rare issue (not specific to pyahocorasick): https://github.com/nexB/scancode-toolkit/blob/f3ef5b4ad823577e507d673a7fbc65d5efe4f6af/src/licensedcode/match.py#L1432

NOTE: we have a rare Unicode bug/issue because of some Unicode codepoint such as some Turkish characters that decompose char + punct when casefolded. See https://github.com/nexB/scancode-toolkit/issues/1872 See also: https://bugs.python.org/issue34723#msg359514

This is eventually a well known problem known as "The Turkish-I Problem" See:

https://blog.codinghorror.com/whats-wrong-with-turkey/
https://docs.microsoft.com/en-us/previous-versions/dotnet/articles/ms973919(v=msdn.10)?redirectedfrom=MSDN#the-motivation-the-turkish-i-problem
https://en.wikipedia.org/wiki/Dotted_and_dotless_I#In_computing

The way I solved it on my side is to check the length does not change when folding case, and if so do something special. This slows down everything of course. Even more so in my case because I split on punctuation before I lowercase (which makes me think that I could simplify my case by lower after my split ...)

See also https://bugs.python.org/issue34723#msg359514 and in particular these messages:

I know it is not finalized and released yet but are you going to implement Version 14.0.0 of the Unicode Standard? It finally solves the issue of Turkish lower/upper case 'I' and 'i'.

Here is the document

We don't update the unicodedata database in patch releases because updates are backwards incompatible. Python 3.9 will ship with 13.0. Python 3.10 is going to ship with 14.0.

IMHO we should wait for Python 3.10

Mar 28 '21 18:03 pombredanne

Note that the rationale for my suggestion to wait is that Python is implementing Unicode and this is a bug in Unicode fixed in Unicode 14+ by introducing a new lower case dotted i character for Turkish that will have a length of one. Python 3.10 will implement Unicode 14

It could be fixed here too of course, but it would only hold if you make an assumption that we are processing unicode strings, which may not be always true.

Mar 29 '21 08:03 pombredanne

Note that the rationale for my suggestion to wait is that Python is implementing Unicode and this is a bug in Unicode fixed in Unicode 14+ by introducing a new lower case dotted i character for Turkish that will have a length of one. Python 3.10 will implement Unicode 14

It could be fixed here too of course, but it would only hold if you make an assumption that we are processing unicode strings, which may not be always true.

German 'ß'.lower() == 'ß', 'ß'.upper() == 'SS', 'ẞ'.lower() == 'ß', 'ß'.upper().lower() == 'ss', 'ß'.casefold() == 'ss'. http://www.personal.psu.edu/ejp10/blogs/gotunicode/2008/07/a-new-german-unicode-letter-ca.html Seems the scancode-toolkit cannot handle this case.

BTW: Unicode equivalence is the specification by the Unicode character encoding standard that some sequences of code points represent essentially the same character. Unicode normalization also may change the text length. It has similar issues.

It'd better find a library which can remember the char offset when normalizing and casefolding text.

Mar 30 '21 01:03 zhu

Thank you @pombredanne for the explanation. I always thought that Unicode is just for assigning the numbers to characters. :)

Apr 07 '21 20:04 WojciechMula

So to the point of @WojciechMula I would rather keep this in a separate add-on and not in the C core. @zhu we now support bytes and unicode, so there should enough to do something with all the detailed gathered here. Would you be willing to help to craft this?

Jan 14 '23 15:01 pombredanne

pyahocorasick pyahocorasick copied to clipboard

Is case-insensitive string match supported?

pyahocorasick
pyahocorasick copied to clipboard