sile \font'ing a \unichar doesn't work; \font'ing its UTF-8 representation does.

SILE v0.10.9.r36-gf16af44, Xubuntu 20.04

\begin[class=book]{document}
\script[src=packages/unichar]
\set[parameter=font.family,value=Noto Serif CJK JP,makedefault=true]
Compare:
\par 風 \font[size=2em]{風}
\par \unichar{0x98a8} \font[size=2em]{\unichar{0x98a8}}
\bigskip
And:
\par 風が吹く \font[size=2em]{風が吹く}
\par \unichar{0x98a8}\unichar{0x304c}\unichar{0x5439}\unichar{0x304f} \font[size=2em]{\unichar{0x98a8}\unichar{0x304c}\unichar{0x5439}\unichar{0x304f}}
\bigskip
Both sets of lines should be equivalent, but are not.
\end{document}

:confounded:

Jul 29 '20 05:07 ctrlcctrlv

Oh, joy. To make it even more fun for you all, this is a regression.

SILE v0.10.5:

I believe the most likely culprit to be @simoncozens' attempted fix to #979, 712bc925dfc1601111922d4bd9089ad161867020.

Jul 29 '20 05:07 ctrlcctrlv

Oh that makes sense. Yes, it would be.

Jul 29 '20 05:07 simoncozens

Since we stuffed \unichar generated code points in the preceding run of nodes so that combining diacritics works, they don't fall in the group being styled. It's as if text rendering hates us.

I'm not quite sure what the answer is. Either we have to parse and handle combining marks separately from other character, end up with side effects like this, just not support combining at all, or handle these kind of inputs in a completely different point in the parsing sequence. Honestly the whole \unichar thing doesn't feel quite right to me, but I don't know what the answer is.

Jul 30 '20 07:07 alerque

I think I’ve got it. Follow my train of thought here:

Maybe a call to \font should create a new “text context“, perhaps by adding a new null hbox.
Wait, that won’t work as \font{\unichar will try to add its generated text to a hbox, which will go horribly wrong.
Come to think of it we already have that problem now if someone does eg \img\unichar.
So maybe \unichar should test if the preceding node is unshaped and add text to it if so, but revert to the old behaviour if not.
If we do that, and add a null hbox at the start of \font processing, it should all work.

Jul 30 '20 07:07 simoncozens

I see what you're saying, but I'm also leery of null hboxes. We've struggled so much already with pushback bugs stemming from things we stuffed in the node queue that don't directly correspond to input...

Jul 30 '20 07:07 alerque

Sorry for finding the diacritics issue and making everything harder on you guys 🙃

If I didn't primarily use SILE to write Unicode requests, I wouldn't care, but I use invalid characters not found in any fonts in my writing all the time, in the hopes the Consortium will make them valid...

Jul 30 '20 20:07 ctrlcctrlv

This is the sort of thing SILE is for, and finding strange edge cases makes design problems clearer!

Jul 31 '20 05:07 simoncozens

Following your train of thought @simoncozens I do see how that will probably work. However I can't help but think the extra hboxes are bound to mess up something we try to do later. Stuffing things in the node list that (gets processed and re-processed for various reasons) seems like the most problematic place we could put a workaround. I wouldn't be so opposed if it was just something \unichar did as a workaround, but imposing the null box output on all \font instances seems like asking for trouble. We already have a couple of hbox stuffing routines and in all cases there have been edge cases where they don't work right. We use them to lock in indentation (e.g. to differentiate them from margins that can be dropped on pushBack()) and to initialize page output (e.g. so that \vfill works at the top of a page). All of these situations are problematic already. Yet another overloading of empty boxes is the node tree that aren't distinguished from other hacks seems like a ... recipe for more trouble down the road.

Personally I don't see how \unichar is a great option for anything and feel like entering real Unicode codepoints. It's so much easier to do the right thing shaping when we have the right data from the get-go.

Clearly the current issue is a problem, but maybe reverting the workaround that tries to stuff this in out of order and just saying \unichar isn't compatible with combining characters is the right thing to do.

Thoughts?

Aug 10 '20 18:08 alerque

I think it's much less of a problem if we make it so \unichar can accept multiple codepoints at once. You're going to break a lot of my documents if you revert the change altogether (of course I know that I can just distribute the current version together with the document). I think the much better solution is to just make it so that I can provide a list of code points rather than a single one.

Aug 10 '20 20:08 ctrlcctrlv

First of all, you shouldn't ever need to distribute a whole version of SILE for something like this, even if we were to drop the \unichar entirely or make it work some other way all you would need to do is copy the Lua function from a version you liked into your projects.

If allowing multiple codepoint sequences would solve this for your use case that's a really interesting option since it would solve both the combining problem and not allow splitting as happened in this issue.

Aug 10 '20 20:08 alerque

I meant current version of unichar.lua 🤔

Aug 10 '20 20:08 ctrlcctrlv

I've been working on #865 and just found the trouble ... a zero hbox hack with unintended side effects. The hack is not in the pullquote package, it is in the base class and the typesetter's default newPar() routine expects to find the hack.

Maybe what we need is some other kind of (unshaped) marker node type with extra meta data that could be used for this sort of purpose and actually specify different behaviors depending on what role the marker was intended for. Overloading empty hboxes is feeling uglier and uglier.

Aug 11 '20 13:08 alerque

Isn't it just wrong to stack/combine characters in the previous nnode if it was pushed in a different font setting context than the current one?

https://github.com/sile-typesetter/sile/blob/2f5e4ffe7d95297bc0a18dbc8a2bd0131817e68c/packages/unichar/init.lua#L13

Suggested change:

    if #hlist > 1 and hlist[#hlist].is_unshaped and pl.tablex.deepcompare(hlist[#hlist].options, SILE.font.loadDefaults({})) then

Sep 01 '22 11:09 Omikhleia

N.B. the fact that unshaped nnodes are pushed with their font context by the typesetter (in method setpar), while other nodes aren't, was evoked in #1361 ... So I just remembered it when seeing this issue recently mentioned, and it seems to me that this provides the expected logical fix... but reviewing the previous comments about (possibly null) hboxes etc. makes me wonder what was the unexplained rationale or the bigger picture.

Sep 01 '22 14:09 Omikhleia

sile sile copied to clipboard

\font'ing a \unichar doesn't work; \font'ing its UTF-8 representation does.

sile
sile copied to clipboard