udar icon indicating copy to clipboard operation
udar copied to clipboard

HFSTTokenizer chokes on input longer than 550(?) characters

Open reynoldsnlp opened this issue 5 years ago • 2 comments

The interactive shell (accessed using pexpect) appears to limit line lengths over 550 (not really sure about this number) characters. If more are given, then bell characters (ascii codepoint 7, displayed as ^G in less) are printed to the logfile and pexpect hangs because it gets no output.

reynoldsnlp avatar Oct 05 '20 11:10 reynoldsnlp

Submitted issue to HFST about this: https://github.com/hfst/hfst/issues/483.

The maximum buffer size appears to be 1024 bytes, so a workaround could check len(bytes(input_str, encoding='utf8')) < 1000, and use a regular subprocess to process that string. This check shouldn't be too expensive.

reynoldsnlp avatar Oct 05 '20 18:10 reynoldsnlp

Workaround implemented in 765a2afb7d95d83b8bb179efe678fbd68e0d90fa.

reynoldsnlp avatar Oct 13 '20 23:10 reynoldsnlp