pyahocorasick icon indicating copy to clipboard operation
pyahocorasick copied to clipboard

End position out after supplementary plane character in Windows

Open woakesd opened this issue 7 years ago • 16 comments

This is an issue with Python under windows only. The attached sample illustrates the problem.

I presume this is because Windows python builds are narrow builds (16 bits per character and supplementary plane characters are stored as a supplementary pair).

I can add the 🙈 as a word and I get a match for it, though end is 9!

issue.txt

woakesd avatar Jan 10 '17 14:01 woakesd

Could you please attach output from the program, on my Linux machine it prints:

$ ~/github/pyahocorasick$ python3 bug_53.py 
(6, 'punched') punched
(16, 'punched') punched

WojciechMula avatar Jan 15 '17 10:01 WojciechMula

I get the following:

PS C:\Projects\pyahocorasick> py -3 .\bug_53.py (6, 'punched') punched (17, 'punched') unched?

I'm pretty sure this is because in windows the string passed in is two byte characters, and the higher plane characters are sent as complementary pairs (two two byte characters).

woakesd avatar Jan 16 '17 08:01 woakesd

@woakesd I will check it tomorrow, when download MSVC build toolchain. I have just MSVC 2008, Py3k needs the newest one.

WojciechMula avatar Jan 17 '17 18:01 WojciechMula

You can use VS2015 community edition.

Good luck.

On 17 Jan 2017, at 18:31, Wojciech MuÅ‚a [email protected] wrote:

@woakesd I will check it tomorrow, when download MSVC build toolchain. I have just MSVC 2008, Py3k needs the newest one.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

woakesd avatar Jan 17 '17 19:01 woakesd

@woakesd OK, I've just reproduced the error, now need to work out a solution.

WojciechMula avatar Jan 18 '17 14:01 WojciechMula

@woakesd I pushed the fix to the master. Could you please check it?

WojciechMula avatar Jan 20 '17 07:01 WojciechMula

Tested this and the simple test passes, which is good.

But on a larger test with 50000 strings in the Automaton I get ValueError exceptions thrown randomly when searching a series of messages. I'm repeating the test and it fails on a different message each time.

Example:

Malformed UCS-2 string: expected a high surrogate at 1662, got dd00 (the position and code change on each run).

I'm struggling to produce a simple test which demonstrates this though. Grr.

Initially I was suspicious of my data but after running it repeatedly it fails in different places.

woakesd avatar Jan 20 '17 13:01 woakesd

I don't really want to post my test data here, can I send them to you privately?

woakesd avatar Jan 20 '17 13:01 woakesd

Thank you very much for testing! Please send your test at [email protected] Nothing would leak, don't worry.

WojciechMula avatar Jan 20 '17 13:01 WojciechMula

Thank you! Will check out the changes later tonight or tomorrow perhaps. It may need to wait till Monday.

woakesd avatar Jan 21 '17 19:01 woakesd

This is fixed as far as I am concerned so I am closing it.

woakesd avatar Feb 17 '17 11:02 woakesd

I'm putting together another regression test for this as I found an issue which I think may be related

woakesd avatar Feb 24 '17 14:02 woakesd

PS I found a work around which is to add one space for each character in the supplementary plane.

haystack += ' ' * len([c for c in haystack if ord(c) > 65535])

woakesd avatar Feb 27 '17 08:02 woakesd

Sorry for the problem, will check it soon.

WojciechMula avatar Mar 02 '17 09:03 WojciechMula

@woakesd Is this still an issue? I think this was fixed, but I am not 100% sure... would you be kind enough to check on the latest checkout?

pombredanne avatar Feb 20 '22 11:02 pombredanne

I think not, but I’ll need to check.

David

📞 0131 332 7571 📱 07966 444 615 📧 @.*** 💻 www.woakes.net

On 20 Feb 2022, at 11:09, Philippe Ombredanne @.***> wrote:

 @woakesd Is this still an issue? I think this was fixed, but I am not 100% sure... would you be kind enough to check on the latest checkout?

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.

woakesd avatar Feb 20 '22 12:02 woakesd

@woakesd I am closing this for now: we have windows tests that pass alright. Please reopen if this is still an issue

pombredanne avatar Jan 14 '23 15:01 pombredanne