segment icon indicating copy to clipboard operation
segment copied to clipboard

Accurate version of the iterator returns results which differs from ultimate/fast on a very short texts

Open dchaplinsky opened this issue 1 year ago • 26 comments

Hi Jarek!

I'm currently working on the project called choppa which is a partial python port of your great library. My intention is to bring sentence tokenization found in LanguageTool to the python world.

To make my life a little bit easier, I decided to implement accurate iterator and sax parser only for now. I successfully ported the code and tests and got it working (despite the lack of the Matcher class in python regexes and general difference in regex syntax between python and java). Then I've started to port tests from LanguageTool for ukrainian language. And most of them worked except for a few. I literally banged my head against the wall for couple of days (you can see it from commit messages).

Then I decided to compile the segment itself and run the tests using srx file found in LanguageTool distro.

And boom. When using fast or ultimate algo, it works flawlessly. But with accurate it fails the same way as my code.

$ echo "Алисов Н. В. , Хореев Б. С." | ./segment -a accurate -s ~/Projects/choppa/data/srx/segment_new.srx -l uk_two -r
Алисов Н. В.
, Хореев Б. С.

$ echo "Алисов Н. В. , Хореев Б. С." | ./segment -a fast -s ~/Projects/choppa/data/srx/segment_new.srx -l uk_two -r
Алисов Н. В. , Хореев Б. С.

$ echo "М. Л. Гончарука, I. О. Денисюка" | ./segment -a fast -s ~/Projects/choppa/data/srx/segment_new.srx -l uk_two -r
М. Л. Гончарука, I. О. Денисюка

$ echo "М. Л. Гончарука, I. О. Денисюка" | ./segment -a accurate -s ~/Projects/choppa/data/srx/segment_new.srx -l uk_two -r
М. Л. Гончарука, I. О.
Денисюка

On one hand I'm now happy that my implementation is still correct. On the other hand, I'm not, because I need to implement either fast or ultimate to make it work and presumably there is an error in the segment lib, which is not covered by the tests.

P.S. I cannot express how grateful I am for the lib you wrote and the quality of its code. Dziękuję bardzo za waszą ciężką pracę!

dchaplinsky avatar Aug 16 '22 20:08 dchaplinsky