Non-word characters flagged as typos

Open ErikAnderson3 opened this issue 4 years ago • 3 comments

This concerns the older Hunspell version 1.2.7, as bundled in the MemoQ translation memory software package.

Using the large EN-US dictionary and affix files from the SCOWL project: http://wordlist.aspell.net/dicts/

Hunspell is confusingly flagging non-word characters as typos, when these characters are used singly, with whitespace on either side. Testing has identified the following as problematic:

　 (double-byte space, U+3000) ↑ (upward arrow, U+2191) ← (leftward arrow, U+2190) @ ("at" mark, U+0040) ＠ (double-byte "at" mark, U+FF20) ^ (caret, U+005E) _ (underscore, U+005F) | (vertical bar, U+007C) \ (backslash, U+005C) ` (backtick, U+0060)

The double-byte characters are not terribly surprising, but other double-byte characters that are not used in English (such as Japanese kana) are not flagged as typos, which is strange.

It is also strange that ↓ (downward arrow, U+2193) and → (rightward arrow, U+2192) are not flagged, while ↑ (upward arrow, U+2191) and ← (leftward arrow, U+2190) are flagged.

I have tried various settings in the affix file to try to get Hunspell to ignore these characters, such as "ICONV ↑ ↓" to convert these problematic characters to known-good ones, but no dice.

Is this a known issue with Hunspell v 1.2.7? ** If so, is there any known workaround? I am not as knowledgeable as I'd like about all the various options possible in the affix files.
Is this perhaps some artifact of how Hunspell v1.2.7 is integrated into memoQ, and should I be reporting to them instead?

Any advice appreciated!

Jul 13 '21 20:07 ErikAnderson3

Bump.

May 19 '22 16:05 ErikAnderson3