xnedit icon indicating copy to clipboard operation
xnedit copied to clipboard

Search for selection acts weird for some characters

Open Alex-Kent opened this issue 1 year ago • 4 comments

I'm seeing odd behavior when using the search for selection (Control+H, referred to as ^H below) feature when the search string contains certain characters. The specific character I'm seeing issues with is ſ (long s). In a nutshell, the search results vary (and sometimes match strings that aren't the search string) depending on where/how the search string characters are selected (see examples below). A sample of text that shows issues is:

# l s ſ	U+017F	latin small letter long S
# f s ſ	U+017F	latin small letter long s
 ſ U

[The above text contains both space and tab characters.]

Some of the issues that I've observed (all related to searching for strings containing the character ſ) are:

  1. Manually select the character ſ on the first line.
    1. Press ^H; the character ſ on the second line is selected.
    2. Press ^H; the selection remains the same (ditto for then pressing ^G).
  2. Manually select the character ſ on the second line.
    1. Press ^H; the character ſ on the third line is selected.
    2. Press ^H; the selection remains the same (ditto for then pressing ^G).
  3. Manually select the character ſ on the third line.
    1. Press ^H; the character ſ on the first line is selected.
    2. Press ^H; the selection remains the same (ditto for then pressing ^G).
  4. Manually select ſ U (four characters) on the third line.
    1. Press ^H; the three-character string ſ on the third line is selected [the trailing character is a tab].
    2. Press ^H; the two-character string ſ on the third line is selected.
    3. Press ^H; the two-character string ſ on the first line is selected.
    4. Press ^H; the two-character string ſ on the second line is selected.
    5. Press ^H; the space character ( ) before the word small on the second line is selected.
    6. All subsequent presses of ^H select subsequent space ( ) characters but not tab characters. The behavior in v. and vi. may vary. I've found that adding any text to the tail of the file changes the latter matches to s (rather than just ). This behavior often (but not always) persists even if the added text is removed manually or with undo. Also Pasting using the middle mouse button into another window (Firefox in my case) before pressing ^H results in the following being pasted: For 4. ſ U For 4. i. ſ For 4. ii. ſ But for 4. iii. it's Å [Not sure if it's relevant but I did type/paste/select/search for Å earlier today.]
  5. Change the string ſ U to ſ U (change third character to a tab); the same behaviour as for 4. is seen.
  6. Manually select the single character ſ or the four-character string ſ U in a different program (Firefox in my test).
    1. With the Xnedit window focused press ^H; no text is selected.

Normal behavior (for comparison):

  1. Select any occurrence of the character U.
    1. Press ^H; the subsequent U character is selected (exactly which one depends on cursor position).
    2. All subsequent presses of ^H select subsequent U characters.
  2. Clear the Xnedit selection then select the character U in a different program (Firefox in my test).
    1. With the Xnedit window focused press ^H; one of the U characters is select (exactly which one depends on cursor position).
    2. All subsequent presses of ^H select subsequent U characters.

It also seems that in the context of searching the characters S and s are considered to be the same letter but ſ is not (i.e. something awry with converting ſ to normal form). Likewise, maybe ss and ß should be considered the same letter/string when searching (though this doesn't appear to be noted in the Unicode spec). [Perhaps this paragraph should be a separate bug report.]

I'm running 1.6.0 (built from source) on Ubuntu Linux 22.04.5.

[Sorry for the deluge; I tried to provide a sizable set of test cases to make it easier to isolate the problem.]

Alex-Kent avatar Jan 04 '25 14:01 Alex-Kent

Just a quick note that this behavior appears specific to the character. I'll post an update if I encounter any others showing similar behavior.

Alex-Kent avatar Jan 04 '25 20:01 Alex-Kent

Update: Selecting Å in an another window and then pressing ^H in Xnedit has no visible effect. Selecting "Å" in another window and then searching with ^H matches all " characters (not the full "Å" string). Behavior when searching for a similar selection made within Xnedit is fine though. [Edit: this is for searches against my full ~/.XCompose file, not the sample above.]

Alex-Kent avatar Jan 04 '25 20:01 Alex-Kent

I have noticed slightly different behaviour, but I think it can depend on the current locale setting. Anyhow, I think I have fixed some (or all) of the bugs. Some behaviour is maybe questionable:

If I select ſ and press ^H, the S at the end of the first line is selected, After that, ^H will only select s or S. In theory this makes sense.

Selecting ſ in firefox and pressing ^H in xnedit should work now.

I have also added an experimental alternative implementation. If you add -DUSE_STRSTR to the CFLAGS and recompile xnedit, Find Selection will use the libc function strcasestr instead, which might have a different behaviour.

unixwork avatar Feb 05 '25 20:02 unixwork

I built the current source [c5b1120e] and tested both options. Both are mostly working as you described (showing none of the issues I originally saw) but ſ is still acting odd (see below). In general I think the case-insensitive version is the better choice but I can see having a literal (strcasestr) match be better in some circumstances; perhaps the desired behavior could be selectable either through nedit.rc or via a preference setting? (I'm partial to making it selectable via the GUI.)

As for locale, running env within Xnedit shows the following on my system: LANG=en_US.UTF-8 LC_ALL=en_US.UTF-8

Searches for ſ are still weird on my system, with ^H (and subsequently ^G) returning not just variants of the letter s (S ś Ś ŝ Ŝ š Š ş Ş [but not s Ṡ ṡ Ṣ ṣ ʂ]) but also some seemingly-unrelated characters (ń Ń ň Ň ņ Ņ ŏ Ŏ ō Ō œ Œ ŕ Ŕ ř Ř ŗ Ŗ ť Ť ţ Ţ ŭ Ŭ ů Ů ũ Ũ ū Ū ŵ Ŵ ŷ Ŷ Ÿ ź Ź ž Ž ż Ż [but not e.g. ÿ]). These matches don't appear to be reciprocal, e.g. ſ matches ů (amongst others) but ů only matches Ů or ů [the expected behaviour]. That ſ matches S but not s is odd since ſ is a variant form of the lower-case s (i.e. it's a lower-case letter).

For completeness, I also noticed that when searching using a case-insensitive regex s and S match each other but don't match ſ or e.g. Š; a case-insensitive regex search for ſ only matches itself.

Out of curiosity, what's the technical reason for ſ acting so oddly?

Alex-Kent avatar Feb 09 '25 09:02 Alex-Kent