omegat
omegat copied to clipboard
Highlight normalized
Pull request type
- [x] Bug fix
- [ ] Feature
- [ ] Documentation
- [ ] Build and release changes
- [ ] Other (describe below)
Which ticket is resolved?
From the discussion list: https://sourceforge.net/p/omegat/mailman/message/37682088/ (no ticket opened)
What does this PR change?
The problem mentioned in the discussion comes from the option "Full/Half width insensitive" : when the option is activated, a normalization of the string occurs, and if this normalization changes the size of the string, actually highlights are unactivated.
Given the sample phrase 今の気温は20℃です。as we know that ℃ is the character which causes problems, For the solution I distinguish 3 cases
- string before the normalized characters, like 気温
- string after the normalized characters, like です
- search containing the normalized characters, for example if you search "20°C"
First commit solves points 1 and 2: characters non-normalized may have been shifted, but that is all Second patch solves point 3, which is more complicated: the found text is normalized and we don't have a method to "un-normalize" in order to find the equivalent in the original string. So, I do a search character per character until I find a piece of text whose normalized form looks like the searched text.
The behavior intentionally designed in past development in 2016 from version 3.6.0. https://github.com/omegat-org/omegat/commit/6d3b0ddc1a31f0b987e0580c52512ef00a890627
Commit message was recorded as;
Give up on highlighting width-normalized search results.
Not only the offsets of the match, but even the length of the match itself can change due to width normalization, so we just won't try.
git-svn-id: svn+ssh://svn.code.sf.net/p/omegat/svn/trunk@8160 b0d8beef-cb45-0410-a8e4-c0d495c3b779
Hi Hiroshi
Thanks for this info.
Now if you look to the pull request you will notice that I divided it into two patches. First one (https://github.com/omegat-org/omegat/pull/234/commits/50cc6a0109a80f8f02b41e4dafff44d98fca7339) works only when the match contains no character affected by normalization: 気温 or です will work (and I do check that what is under the match position+length does correspond to searched text), but not 20°C because °C is affected by normalization and, as you say, the part of the string where we find it does change the length of the match. Now in the second patch (https://github.com/omegat-org/omegat/pull/234/commits/f6405fbc45ef1157956fe093a4ba2061a08d2049) I try to find the equivalent of the normalized match in the non-normalized string, including position and length. It works in the given sample but I am not absolutely sure it is perfect, that is the reason why I asked you to test in other samples: as a japanese user, you probably have more examples than me for this topic. I invite you also to test this in big projects or big phrases: this method (character per character, multiple calls to normalisation) may be slow (but this is the only method I see as long as we don't have a method to reverse normalization), so it is useful to know.
In worst case, if this part is too slow or has side effects, we may restrict to the first patch and add in the documentation that usage of option "full/half width insensitive" may break highlighting in some cases.
Also in the code you quoted I saw a potential bug: checking only the length of the target string is not enough - what if the normalization reduces some characters and explodes other ones? It could happen that the length is identical while match contains normalized characters - and the length problem Aaron tried to avoid reappears.
Also in the code you quoted I saw a potential bug: checking only the length of the target string is not enough - what if the normalization reduces some characters and explodes other ones? It could happen that the length is identical while match contains normalized characters - and the length problem Aaron tried to avoid reappears.
Good catch and I agree with your insight.
merged to master