Search for selection acts weird for some characters
I'm seeing odd behavior when using the search for selection (Control+H, referred to as ^H below) feature when the search string contains certain characters. The specific character I'm seeing issues with is ſ (long s). In a nutshell, the search results vary (and sometimes match strings that aren't the search string) depending on where/how the search string characters are selected (see examples below). A sample of text that shows issues is:
# l s ſ U+017F latin small letter long S
# f s ſ U+017F latin small letter long s
ſ U
[The above text contains both space and tab characters.]
Some of the issues that I've observed (all related to searching for strings containing the character ſ) are:
- Manually select the character
ſon the first line.- Press
^H; the characterſon the second line is selected. - Press
^H; the selection remains the same (ditto for then pressing^G).
- Press
- Manually select the character
ſon the second line.- Press
^H; the characterſon the third line is selected. - Press
^H; the selection remains the same (ditto for then pressing^G).
- Press
- Manually select the character
ſon the third line.- Press
^H; the characterſon the first line is selected. - Press
^H; the selection remains the same (ditto for then pressing^G).
- Press
- Manually select
ſ U(four characters) on the third line.- Press
^H; the three-character stringſon the third line is selected [the trailing character is a tab]. - Press
^H; the two-character stringſon the third line is selected. - Press
^H; the two-character stringſon the first line is selected. - Press
^H; the two-character stringſon the second line is selected. - Press
^H; the space character () before the wordsmallon the second line is selected. - All subsequent presses of
^Hselect subsequent space () characters but not tab characters. The behavior in v. and vi. may vary. I've found that adding any text to the tail of the file changes the latter matches tos(rather than just). This behavior often (but not always) persists even if the added text is removed manually or with undo. Also Pasting using the middle mouse button into another window (Firefox in my case) before pressing^Hresults in the following being pasted: For 4.ſ UFor 4. i.ſFor 4. ii.ſBut for 4. iii. it'sÅ[Not sure if it's relevant but I did type/paste/select/search forÅearlier today.]
- Press
- Change the string
ſ Utoſ U(change third character to a tab); the same behaviour as for 4. is seen. - Manually select the single character
ſor the four-character stringſ Uin a different program (Firefox in my test).- With the Xnedit window focused press
^H; no text is selected.
- With the Xnedit window focused press
Normal behavior (for comparison):
- Select any occurrence of the character
U.- Press
^H; the subsequentUcharacter is selected (exactly which one depends on cursor position). - All subsequent presses of
^Hselect subsequentUcharacters.
- Press
- Clear the Xnedit selection then select the character
Uin a different program (Firefox in my test).- With the Xnedit window focused press
^H; one of theUcharacters is select (exactly which one depends on cursor position). - All subsequent presses of
^Hselect subsequentUcharacters.
- With the Xnedit window focused press
It also seems that in the context of searching the characters S and s are considered to be the same letter but ſ is not (i.e. something awry with converting ſ to normal form). Likewise, maybe ss and ß should be considered the same letter/string when searching (though this doesn't appear to be noted in the Unicode spec). [Perhaps this paragraph should be a separate bug report.]
I'm running 1.6.0 (built from source) on Ubuntu Linux 22.04.5.
[Sorry for the deluge; I tried to provide a sizable set of test cases to make it easier to isolate the problem.]
Just a quick note that this behavior appears specific to the character. I'll post an update if I encounter any others showing similar behavior.
Update: Selecting Å in an another window and then pressing ^H in Xnedit has no visible effect. Selecting "Å" in another window and then searching with ^H matches all " characters (not the full "Å" string). Behavior when searching for a similar selection made within Xnedit is fine though. [Edit: this is for searches against my full ~/.XCompose file, not the sample above.]
I have noticed slightly different behaviour, but I think it can depend on the current locale setting. Anyhow, I think I have fixed some (or all) of the bugs. Some behaviour is maybe questionable:
If I select ſ and press ^H, the S at the end of the first line is selected, After that, ^H will only select s or S. In theory this makes sense.
Selecting ſ in firefox and pressing ^H in xnedit should work now.
I have also added an experimental alternative implementation. If you add -DUSE_STRSTR to the CFLAGS and recompile xnedit, Find Selection will use the libc function strcasestr instead, which might have a different behaviour.
I built the current source [c5b1120e] and tested both options. Both are mostly working as you described (showing none of the issues I originally saw) but ſ is still acting odd (see below). In general I think the case-insensitive version is the better choice but I can see having a literal (strcasestr) match be better in some circumstances; perhaps the desired behavior could be selectable either through nedit.rc or via a preference setting? (I'm partial to making it selectable via the GUI.)
As for locale, running env within Xnedit shows the following on my system:
LANG=en_US.UTF-8
LC_ALL=en_US.UTF-8
Searches for ſ are still weird on my system, with ^H (and subsequently ^G) returning not just variants of the letter s (S ś Ś ŝ Ŝ š Š ş Ş [but not s Ṡ ṡ Ṣ ṣ ʂ]) but also some seemingly-unrelated characters (ń Ń ň Ň ņ Ņ ŏ Ŏ ō Ō œ Œ ŕ Ŕ ř Ř ŗ Ŗ ť Ť ţ Ţ ŭ Ŭ ů Ů ũ Ũ ū Ū ŵ Ŵ ŷ Ŷ Ÿ ź Ź ž Ž ż Ż [but not e.g. ÿ]). These matches don't appear to be reciprocal, e.g. ſ matches ů (amongst others) but ů only matches Ů or ů [the expected behaviour]. That ſ matches S but not s is odd since ſ is a variant form of the lower-case s (i.e. it's a lower-case letter).
For completeness, I also noticed that when searching using a case-insensitive regex s and S match each other but don't match ſ or e.g. Š; a case-insensitive regex search for ſ only matches itself.
Out of curiosity, what's the technical reason for ſ acting so oddly?