languagetool
languagetool copied to clipboard
Skip soft hyphens and acutes in words
We need skip some unicode symbols when checking words. For instance, LanguageTool marks as bad words with soft hyphen (Unicode U+00AD or UTF-8 C2 AD). Also using combining acute accent (Unicode U+0301 or UTF-8 CC 81) do word unknown.
http://www.fileformat.info/info/unicode/char/00ad/index.htm http://www.fileformat.info/info/unicode/char/0301/index.htm
Issue looks related to this one: https://github.com/languagetool-org/languagetool/issues/23
Soft hyphen is not a combining character, it is 'skipped character'. My mark was applicated to russian words, where accented characters is not a part of alphabet or orphographic. It would be wrong to combining russian characters with some signs, so I thinks #23 is related, but has some other way to resolve.
I have a (sort of) solution for Ukrainian - I just remove the characters I should ignore (stress symbol, soft hyphen), you can see how it's done in UkrainianWordTokenizer.java Unfortunately there's a problem - when you remove characters like this and the error in the sentence is detected (after the remove symbol) the position of error marker will be offset. There are two potential solutions to this problem: a) (hack, but if possible will be fairly easy to implement) for each removed character insert some "empty" character at the end of the word (which is ignored by LT engine but will preserve the word positions) b) (right solution, but requires some significant changes to languagetool-core) take to account "token cleanup offset" and adjust the error positions when shown Unfortunately I haven't had a chance to research either of those yet, but if somebody can make it work I think support for "ignored characters" should be moved into common code.
Regards, Andriy
2014-10-21 8:39 GMT-04:00 Vitaly Lipatov [email protected]:
Soft hyphen is not a combining character, it is 'skipped character'. My mark was applicated to russian words, where accented characters is not a part of alphabet or orphographic. It would be wrong to combining russian characters with some signs, so I thinks #23 https://github.com/languagetool-org/languagetool/issues/23 is related, but has some other way to resolve.
— Reply to this email directly or view it on GitHub https://github.com/languagetool-org/languagetool/issues/204#issuecomment-59920749 .
Actually we already have JLanguageTool.replaceSoftHyphens()
which could maybe be extended? We also have check(AnnotatedText text)
, maybe the special chars could be considered markup and would then be ignored.
One issue that is related to this is that we assume NFKC unicode normalization. But this need not be the case, and we are not explicit about it anywhere.
I have adjusted replaceSoftHypens() to be more flexible and currently for Ukrainian I successfully ignore U+00AD and U+0301. Any Language subclass can call setIgnoredCharactersRegex() to provide characters to ignore. Not sure about unicode normalization though.
The issues with soft hyphens are solved with annotated text: https://github.com/languagetool-org/languagetool/pull/6932/commits/403b70f1a3ed2f1de1924fe00f6ceea06bac2363
The combining characters are still marked as spelling errors by LanguageTool. But they don't seem to be a frequent issue. They could be normalized in the tagger (not changing the input, but avoiding spelling and tagging problems).