sile
sile copied to clipboard
\font'ing a \unichar doesn't work; \font'ing its UTF-8 representation does.
SILE v0.10.9.r36-gf16af44, Xubuntu 20.04
\begin[class=book]{document}
\script[src=packages/unichar]
\set[parameter=font.family,value=Noto Serif CJK JP,makedefault=true]
Compare:
\par 風 \font[size=2em]{風}
\par \unichar{0x98a8} \font[size=2em]{\unichar{0x98a8}}
\bigskip
And:
\par 風が吹く \font[size=2em]{風が吹く}
\par \unichar{0x98a8}\unichar{0x304c}\unichar{0x5439}\unichar{0x304f} \font[size=2em]{\unichar{0x98a8}\unichar{0x304c}\unichar{0x5439}\unichar{0x304f}}
\bigskip
Both sets of lines should be equivalent, but are not.
\end{document}
:confounded:
Oh, joy. To make it even more fun for you all, this is a regression.
SILE v0.10.5:
I believe the most likely culprit to be @simoncozens' attempted fix to #979, 712bc925dfc1601111922d4bd9089ad161867020.
Oh that makes sense. Yes, it would be.
Since we stuffed \unichar
generated code points in the preceding run of nodes so that combining diacritics works, they don't fall in the group being styled. It's as if text rendering hates us.
I'm not quite sure what the answer is. Either we have to parse and handle combining marks separately from other character, end up with side effects like this, just not support combining at all, or handle these kind of inputs in a completely different point in the parsing sequence. Honestly the whole \unichar
thing doesn't feel quite right to me, but I don't know what the answer is.
I think I’ve got it. Follow my train of thought here:
- Maybe a call to \font should create a new “text context“, perhaps by adding a new null hbox.
- Wait, that won’t work as \font{\unichar will try to add its generated text to a hbox, which will go horribly wrong.
- Come to think of it we already have that problem now if someone does eg \img\unichar.
- So maybe \unichar should test if the preceding node is unshaped and add text to it if so, but revert to the old behaviour if not.
- If we do that, and add a null hbox at the start of \font processing, it should all work.
I see what you're saying, but I'm also leery of null hboxes. We've struggled so much already with pushback bugs stemming from things we stuffed in the node queue that don't directly correspond to input...
Sorry for finding the diacritics issue and making everything harder on you guys 🙃
If I didn't primarily use SILE to write Unicode requests, I wouldn't care, but I use invalid characters not found in any fonts in my writing all the time, in the hopes the Consortium will make them valid...
This is the sort of thing SILE is for, and finding strange edge cases makes design problems clearer!
Following your train of thought @simoncozens I do see how that will probably work. However I can't help but think the extra hboxes are bound to mess up something we try to do later. Stuffing things in the node list that (gets processed and re-processed for various reasons) seems like the most problematic place we could put a workaround. I wouldn't be so opposed if it was just something \unichar
did as a workaround, but imposing the null box output on all \font
instances seems like asking for trouble. We already have a couple of hbox stuffing routines and in all cases there have been edge cases where they don't work right. We use them to lock in indentation (e.g. to differentiate them from margins that can be dropped on pushBack()
) and to initialize page output (e.g. so that \vfill
works at the top of a page). All of these situations are problematic already. Yet another overloading of empty boxes is the node tree that aren't distinguished from other hacks seems like a ... recipe for more trouble down the road.
Personally I don't see how \unichar
is a great option for anything and feel like entering real Unicode codepoints. It's so much easier to do the right thing shaping when we have the right data from the get-go.
Clearly the current issue is a problem, but maybe reverting the workaround that tries to stuff this in out of order and just saying \unichar
isn't compatible with combining characters is the right thing to do.
Thoughts?
I think it's much less of a problem if we make it so \unichar can accept multiple codepoints at once. You're going to break a lot of my documents if you revert the change altogether (of course I know that I can just distribute the current version together with the document). I think the much better solution is to just make it so that I can provide a list of code points rather than a single one.
First of all, you shouldn't ever need to distribute a whole version of SILE for something like this, even if we were to drop the \unichar
entirely or make it work some other way all you would need to do is copy the Lua function from a version you liked into your projects.
If allowing multiple codepoint sequences would solve this for your use case that's a really interesting option since it would solve both the combining problem and not allow splitting as happened in this issue.
I meant current version of unichar.lua 🤔
I've been working on #865 and just found the trouble ... a zero hbox hack with unintended side effects. The hack is not in the pullquote package, it is in the base class and the typesetter's default newPar()
routine expects to find the hack.
Maybe what we need is some other kind of (unshaped) marker node type with extra meta data that could be used for this sort of purpose and actually specify different behaviors depending on what role the marker was intended for. Overloading empty hboxes is feeling uglier and uglier.
Isn't it just wrong to stack/combine characters in the previous nnode if it was pushed in a different font setting context than the current one?
https://github.com/sile-typesetter/sile/blob/2f5e4ffe7d95297bc0a18dbc8a2bd0131817e68c/packages/unichar/init.lua#L13
Suggested change:
if #hlist > 1 and hlist[#hlist].is_unshaped and pl.tablex.deepcompare(hlist[#hlist].options, SILE.font.loadDefaults({})) then
N.B. the fact that unshaped nnodes are pushed with their font context by the typesetter (in method setpar
), while other nodes aren't, was evoked in #1361 ... So I just remembered it when seeing this issue recently mentioned, and it seems to me that this provides the expected logical fix... but reviewing the previous comments about (possibly null) hboxes etc. makes me wonder what was the unexplained rationale or the bigger picture.