`range of selection` inconsistent with Unicode 'Combining Acute Accent' characters
Description
CE's range of selection is inconsistent when handling different kinds of unicode characters with acute accents. I discovered this when I pasted this name into a document: "Kapuściński."
The ś and ń in this text are not standard unicode characters (U+015B and U+0144) but rather characters 'Combining Acute Accent'—i.e. they actually consist of two separate characters (the first is U+0073 and U+0301, the second is U+006E and U+0301).
This causes issues, at least in my use case.
To Reproduce
If I place the caret here -> "Kapuścińs^ki" then range of selection returns {11,0}. If I replace the accented characters with standard unicode (U+015B and U+0144) then it returns {9,0}, as expected.
Obviously, in the first case, any script that depends on this function then stops working properly.
Expected behavior
I considered just adding a text replacement subroutine to my on document saved handler. However, I can't get AppleScript to replace the characters.
According to macosxautomation, AppleScript should regard 'Combining Acute Accent' characters as a single character since AppleScript 2.0.
I guess the issue is whether CotEditor should follow this principle or not.
Edit:
I just tried to use CE's text replacement syntax (tell front document to replace for "ś" to "X") and it doesn't work. Where the for is U+0073/U+0301, CE ignores these characters and replaces U+015B instead.
CotEditor version
4.3.0
macOS version
12.4
Additional context
No response
I'd say it's in spec. CotEditor handles the string in the document as-is. In addition, AppleScript's API by CotEditor counts characters in the UTF-16 unit. Thus, if some characters in the document are not combined, then CotEditor handles them as two-length characters. It depends on how characters are stored in the actual document contents. I believe CotEditor should not change the current behavior because some users need to handle characters strictly.
I just tried to use CE's text replacement syntax (tell front document to replace for "ś" to "X") and it doesn't work. Where the for is U+0073/U+0301, CE ignores these characters and replaces U+015B instead.
I guess this is another issue. The current implementation of CotEditor should replace both U+0073/U+0301 and U+015B with that command. See the screencast I tested below (I also attached the script and document):
https://user-images.githubusercontent.com/1165044/182037996-6fbc84ca-7c54-4185-b4bf-2860596c0b81.mov
Did you add the with all option to replace all occurrences instead of the first met one?
If you added the option and still cannot replace both, please send me a set of the script and sample document file that can reproduce this issue. Then I'll research further.
Hmmm, this is interesting. Though I can't decide right away what is the best implementation... Let me think.
"Did you add the with all option to replace all occurrences instead of the first met one?"
Ahhhh, this seems to be my error :)
I take the point that it makes sense for a powerful text editor to treat such characters in the most technically literal way possible, regardless of what AppleScript does.
I will go with my first instinct and use the replace for command in my on document saved handler (now that I understand that this works). That should solve it.
I use CE to edit preferences files, write some code, and write prose. So, it can be complicated to achieve a setup that satisfies all cases.
Thanks for your help.
Ahhhh, this seems to be my error :)
Ok, that's good to hear there isn't a bug ;-)
Regarding the unit to count characters, I changed my mind by reading the documentation about AppleScript by Apple. I first thought it was about Unicode normalization; however, in fact, AppleScript counts characters not by UTF-16, which I believed, but by grapheme clusters since AppleScript 2.0 (Mac OS X 10.5, 2007).
The current CotEditor's specification regarding character count was already designed when I adopted the CotEditor project in 2013 and so I have preserved it since now just to keep the compatibility. In other words, the current implementation was probably designed before AppleScript 2.0 was released. But if the specification in the AppleScript side was changed, CotEditor should follow it.
Now I plan to change the way to count characters for AppleScript in CotEditor in the next minor update, CotEditor 4.4.0, which will be released this fall. This is a kind of breaking change, nevertheless, I suppose I should do it.
Thank you for realizing this inconsistency of spec between AppleScript CotEditor.