omegat Highlight normalized

Pull request type

[x] Bug fix
[ ] Feature
[ ] Documentation
[ ] Build and release changes
[ ] Other (describe below)

Which ticket is resolved?

From the discussion list: https://sourceforge.net/p/omegat/mailman/message/37682088/ (no ticket opened)

What does this PR change?

The problem mentioned in the discussion comes from the option "Full/Half width insensitive" : when the option is activated, a normalization of the string occurs, and if this normalization changes the size of the string, actually highlights are unactivated.

Given the sample phrase 今の気温は20℃です。as we know that ℃ is the character which causes problems, For the solution I distinguish 3 cases

string before the normalized characters, like 気温
string after the normalized characters, like です
search containing the normalized characters, for example if you search "20°C"

First commit solves points 1 and 2: characters non-normalized may have been shifted, but that is all Second patch solves point 3, which is more complicated: the found text is normalized and we don't have a method to "un-normalize" in order to find the equivalent in the original string. So, I do a search character per character until I find a piece of text whose normalized form looks like the searched text.

Aug 01 '22 15:08 t-cordonnier

The behavior intentionally designed in past development in 2016 from version 3.6.0. https://github.com/omegat-org/omegat/commit/6d3b0ddc1a31f0b987e0580c52512ef00a890627

Commit message was recorded as;

Give up on highlighting width-normalized search results.

Not only the offsets of the match, but even the length of the match itself can change due to width normalization, so we just won't try.

git-svn-id: svn+ssh://svn.code.sf.net/p/omegat/svn/trunk@8160 b0d8beef-cb45-0410-a8e4-c0d495c3b779

Aug 04 '22 22:08 miurahr

Hi Hiroshi

Thanks for this info.

Now if you look to the pull request you will notice that I divided it into two patches. First one (https://github.com/omegat-org/omegat/pull/234/commits/50cc6a0109a80f8f02b41e4dafff44d98fca7339) works only when the match contains no character affected by normalization: 気温 or です will work (and I do check that what is under the match position+length does correspond to searched text), but not 20°C because °C is affected by normalization and, as you say, the part of the string where we find it does change the length of the match. Now in the second patch (https://github.com/omegat-org/omegat/pull/234/commits/f6405fbc45ef1157956fe093a4ba2061a08d2049) I try to find the equivalent of the normalized match in the non-normalized string, including position and length. It works in the given sample but I am not absolutely sure it is perfect, that is the reason why I asked you to test in other samples: as a japanese user, you probably have more examples than me for this topic. I invite you also to test this in big projects or big phrases: this method (character per character, multiple calls to normalisation) may be slow (but this is the only method I see as long as we don't have a method to reverse normalization), so it is useful to know.

In worst case, if this part is too slow or has side effects, we may restrict to the first patch and add in the documentation that usage of option "full/half width insensitive" may break highlighting in some cases.

Aug 05 '22 06:08 t-cordonnier

Also in the code you quoted I saw a potential bug: checking only the length of the target string is not enough - what if the normalization reduces some characters and explodes other ones? It could happen that the length is identical while match contains normalized characters - and the length problem Aaron tried to avoid reappears.

Aug 05 '22 06:08 t-cordonnier

Also in the code you quoted I saw a potential bug: checking only the length of the target string is not enough - what if the normalization reduces some characters and explodes other ones? It could happen that the length is identical while match contains normalized characters - and the length problem Aaron tried to avoid reappears.

Good catch and I agree with your insight.

Aug 13 '22 03:08 miurahr

merged to master

Aug 18 '22 11:08 miurahr

omegat omegat copied to clipboard

Highlight normalized

Pull request type

Which ticket is resolved?

What does this PR change?

omegat
omegat copied to clipboard