hammerspoon
hammerspoon copied to clipboard
hs.styledtext and getString() for high plane Unicode characters
Hello,
I am trying to create a chooser that will help me select between different cuneiform unicode characters. The setup and display of these characters -- which are high up in the unicode plane, above 0x12000 -- works fine: the chooser displays and filters properly and there are no issues. However, when I "select" a choice, I am unable to extract the plain unicode string out of the choice itself.
I have confirmed that I can get a hs.styledtext object back. And in the hammerspoon console it even prints this object with the correct character. But myText:getString() returns nil
every time.
Here is an example:
myString = "𒍠" // a unicode cuneiform character (might need fonts to display)
styled = hs.styledtext(myString)
print(styled:getString())
Here is a shot of my console:
I'm not sure why that's not working, but as a potential workaround, could you just use a lookup table?
For example:
lookupTable = {
["item1"] = hs.styledtext.new("item1")
["item2"] = hs.styledtext.new("item2")
}
...and use the key in the lookupTable
to return your actual styled text result?
After adding some logging into styledtext and trying your sample code, the problem appears to be because of encoding differences between UTF8 and UTF16...
As hex bytes, myString
:
> hs.utf8.hexDump(myString)
00 : F0 92 8D A0 : ....
That entire sequence of 4 bytes is considered a single UTF8 character based solely on byte values (see lua's utf8.charpattern
) -- I don't know yet if it's actually a valid single character according to the spec -- I'll have to research that.
However, macOS does everything in UTF16, and it considers the sequence as 2 UTF16 characters. There is a conversion function which attempts to map between the multi-byte-sized UTF8 sequences and the fixed-byte-sized of UTF16, but it's assuming the byte pattern is a sufficient test which apparently is a naive assumption.
This causes the code which pulls the actual string out of the styled text object to fail because the range specified breaks a composite character.
The fix will require digging into the specs for UTF8 and UTF16 a bit deeper (ugh) so I don't have an eta, but I'll add it to my list and see what I can come up with.
Piggybacking on this, there are also other problems with some unicode characters that are probably 4 byte in UTF-16 (e.g. emoji). For example, using styledtext which consists of subparts selected by indices (e.g. using :setStyle(whatever, start, end)
) doesn't only produce the usual problems of character vs byte indexing, but also corrupts some characters in ways I have yet to understand.
Sorry it took so long, but if any of you are comfortable building your own local copy of Hammerspoon, give pull 3356 a look and let me know if it works for you.
I'd love to, but I'm really burnt out at the moment, so I'd like to sit this out and let someone else here in the thread see if it works ^^
I tried to put a string that contains an emoji into styledText and color a substring of that text, it works perfectly without emojis, but as soon as an emoji is there, all those positions are off.
I want to use more emojis lol
Before or after applying the new pull? And if after, can you give specific emojis? When I tried it with emoji's on my build, it works.
Just existing version, I haven't tried the new pull. I don't know how to build locally and this wasn't bothering me enough to justify learning.
I'll illustrate the difference:
No emojis, looks great:
With emojis, everything starts shifting:
I have this function that pads the columns and finds the indexes of the first column, later when I use styledText to add color, it doesn't work well:
function formatTable(t)
local maxLen = 0
local indexTable = {}
local outStr = ""
-- Find the maximum length of the first column
for _, v in pairs(t) do
if #v[1] > maxLen then
maxLen = #v[1]
end
end
-- Create the output string and index table
for i, v in pairs(t) do
local padding = string.rep(" ", maxLen - #v[1])
indexTable[i] = {start=#outStr+1, stop=#outStr+#v[1]}
outStr = outStr .. v[1] .. padding .. " " .. v[2] .. "\n"
end
return outStr, indexTable
end
I initially thought using utf8.len()
would be the solution, but I guess some are utf16?
Old unicode based emojis work fine - ✅ ☠️ 〰️ Modern emojis all cause a problem - 🤔 🧲 😡
Well, the problem boils down to the fact that utf8 characters can be 1-4 bytes long -- the spec utf8 is a subset of allows longer, but as UTF is (so far) limited to 0x10ffff code-points, 5+ aren't used (yet); while utf16 characters are always 2 bytes... except for those that are considered surrogate pairs, in which case they are 4 bytes -- 2 utf16 code-points that together represent a single utf16 character. (edited)
It's annoying.
@Madd0g I'll be curious to see how the new build works with your code when @cmsj creates a new release... the updated pull has been merged.
I tried the new version, it's still problematic, as soon as I use an emoji, everything gets skewed:

however, I still have a lot of places where I measure string lengths with #str
, should I be using utf8.len(str)
everywhere for this fix to work?
Actually, if I use utf8.len()
and only older emojis, it works nicely...

Unfortunately, using newer emojis still messes it up: 🧲 🤖 4️⃣