languagetool icon indicating copy to clipboard operation
languagetool copied to clipboard

Skip soft hyphens and acutes in words

Open vitlav opened this issue 10 years ago • 7 comments

We need skip some unicode symbols when checking words. For instance, LanguageTool marks as bad words with soft hyphen (Unicode U+00AD or UTF-8 C2 AD). Also using combining acute accent (Unicode U+0301 or UTF-8 CC 81) do word unknown.

http://www.fileformat.info/info/unicode/char/00ad/index.htm http://www.fileformat.info/info/unicode/char/0301/index.htm

vitlav avatar Oct 21 '14 09:10 vitlav

Issue looks related to this one: https://github.com/languagetool-org/languagetool/issues/23

dpelle avatar Oct 21 '14 09:10 dpelle

Soft hyphen is not a combining character, it is 'skipped character'. My mark was applicated to russian words, where accented characters is not a part of alphabet or orphographic. It would be wrong to combining russian characters with some signs, so I thinks #23 is related, but has some other way to resolve.

vitlav avatar Oct 21 '14 12:10 vitlav

I have a (sort of) solution for Ukrainian - I just remove the characters I should ignore (stress symbol, soft hyphen), you can see how it's done in UkrainianWordTokenizer.java Unfortunately there's a problem - when you remove characters like this and the error in the sentence is detected (after the remove symbol) the position of error marker will be offset. There are two potential solutions to this problem: a) (hack, but if possible will be fairly easy to implement) for each removed character insert some "empty" character at the end of the word (which is ignored by LT engine but will preserve the word positions) b) (right solution, but requires some significant changes to languagetool-core) take to account "token cleanup offset" and adjust the error positions when shown Unfortunately I haven't had a chance to research either of those yet, but if somebody can make it work I think support for "ignored characters" should be moved into common code.

Regards, Andriy

2014-10-21 8:39 GMT-04:00 Vitaly Lipatov [email protected]:

Soft hyphen is not a combining character, it is 'skipped character'. My mark was applicated to russian words, where accented characters is not a part of alphabet or orphographic. It would be wrong to combining russian characters with some signs, so I thinks #23 https://github.com/languagetool-org/languagetool/issues/23 is related, but has some other way to resolve.

— Reply to this email directly or view it on GitHub https://github.com/languagetool-org/languagetool/issues/204#issuecomment-59920749 .

arysin avatar Oct 21 '14 13:10 arysin

Actually we already have JLanguageTool.replaceSoftHyphens() which could maybe be extended? We also have check(AnnotatedText text), maybe the special chars could be considered markup and would then be ignored.

danielnaber avatar Oct 21 '14 14:10 danielnaber

One issue that is related to this is that we assume NFKC unicode normalization. But this need not be the case, and we are not explicit about it anywhere.

milekpl avatar Mar 03 '15 11:03 milekpl

I have adjusted replaceSoftHypens() to be more flexible and currently for Ukrainian I successfully ignore U+00AD and U+0301. Any Language subclass can call setIgnoredCharactersRegex() to provide characters to ignore. Not sure about unicode normalization though.

arysin avatar Mar 12 '15 02:03 arysin

The issues with soft hyphens are solved with annotated text: https://github.com/languagetool-org/languagetool/pull/6932/commits/403b70f1a3ed2f1de1924fe00f6ceea06bac2363

The combining characters are still marked as spelling errors by LanguageTool. But they don't seem to be a frequent issue. They could be normalized in the tagger (not changing the input, but avoiding spelling and tagging problems).

jaumeortola avatar Jul 25 '22 11:07 jaumeortola