LaTeXML icon indicating copy to clipboard operation
LaTeXML copied to clipboard

comment becomes argument to later macro

Open teepeemm opened this issue 9 months ago • 5 comments

This is something I've come across while working on #1734. Compiling the following

\documentclass{article}
\begin{filecontents*}[overwrite]{\jobname.latexml}
DefMacro('\dumpArg {}', sub {
  my ($gullet, $arg) = @_;
  print STDERR "args:\n";
  foreach ( @{$arg} ) {
    print STDERR $_->getCatcode . ': ' . $_->toString . "\n"; }
  return $_[1]; });
1;
\end{filecontents*}
\providecommand{\dumpArg}[1]{#1}
\begin{document}
previous line
\dumpArg{1}
%previous line
\dumpArg{2}
done
\end{document}

with LaTeX yields the pdf "previous line 1 2 done" (and creates the latexml file). Compiling with latexml --nocomments creates the xml

    <p>previous line
1
2
done</p>

and stderr

args:
12: 1
args:
12: 2

Compiling with comments, however, creates the xml

    <p><!--  %previous line -->previous line
1
2
done</p>

so that the comment has moved to an unexpected place. Even worse, stderr shows

args:
12: 1
args:
14: %previous line
12: 2

so that the "previous line" comment became the first argument to the subsequent macro.

I'm not sure why this is happening. I don't think it generally affects things, since it appears you have to be accessing the arguments in a non-standard way. But siunitx accesses the arguments in a non-standard way, so I wanted to ask if this is expected and we need to modify siunitx. (And I assume --nocomments is why we've not noticed this before, since that removes the behavior. Otherwise, t/complex/si.tex doesn't compile.)

teepeemm avatar Feb 26 '25 20:02 teepeemm

so that the comment has moved to an unexpected place.

This may be a minor XML nit. If you had any markup around the plain text, e.g. using \emph around "previous line", the comment gets deposited in the correct time and place. I suspect specifically the handling of text nodes leads to the comments getting flushed up to the start of the parent element.

dginev avatar Feb 26 '25 21:02 dginev

I can reproduce the comment token getting picked up in the argument slot for the following macro call. We should likely guard in the reader that pending comments get inserted in the document, or at least kept deferred?

[aside] You're using a very curious details with \jobname.latexml. That (hidden?) feature doesn't seem to work with \jobname.ltxml which would have been my preferred extension.

Thank you for the report and the great example!

dginev avatar Feb 26 '25 21:02 dginev

[My interpretation is that the advantage of .latexml is the file is automatically preloaded without having to worry about command line arguments. With .ltxml, you need to --preload the file.]

teepeemm avatar Feb 26 '25 21:02 teepeemm

Sure, but it's the \jobname. part of that feature that is the key piece, the extension could have been .ltxml. Maybe we can extend the feature to also cover \jobname.ltxml.

Edit: To clarify, "that feature" refers to "automatically picking up and loading a binding file named afther the current \jobname". I am only observing the difference between .latexml and .ltxml as extensions, and was trying to add a note to consider also supporting the .ltxml extension -- it didn't work for me when I tried.

dginev avatar Feb 26 '25 21:02 dginev

I think I'm not understanding what you mean by "that feature". OTOH, that doesn't seem related to the issue report.

teepeemm avatar Feb 26 '25 22:02 teepeemm

This appears to have regressed further on the latest latexml branch, adding a mark. If I can recover it I'll also add the motivating example as a test.

dginev avatar May 12 '25 10:05 dginev

First comments about \jobname.latexml, which is both confusing and a red-herring :>. I think @dginev tends to forget about this .latexml extension "feature", but this file is handy to customize the processing of a specific document w/o the mess of introducing a style file and binding, and then using \usepackage. This file gets loaded before processing the file, so the virtual file LaTeXML would create from the filecontents wouldn't have been seen anyway... except that the previous run by pdflatex did create the real file so that it gets used.

This would have been a big security hole, but LaTeXML doesn't consult these virtual files for .ltxml (or .latexml).

brucemiller avatar Sep 10 '25 14:09 brucemiller

Regarding carrying comment tokens through the processing (where TeX itself discards them immediately): It may well be the case that this "feature" is more of a "misfeature", as it has caused quite a bit of debugging hassle; but I kinda hate ripping it all out at this point.

For your test case, the extra token may well be "unexpected", but depending on how you deal with the argument, not necessarily "wrong". I've recently made a variety of patches to deal better with these unexpected tokens, and specifically in siunitx. So it would be good to know if there are still any siunitx issues specifically related to the comment tokens (I know there are still versioning issues). If not, I'd rather avoid adding any extra layers of conditionals as suggested by #2632.

brucemiller avatar Sep 10 '25 15:09 brucemiller

Out of curiosity, what extra layers of conditionals are suggested by #2632 ? It only suggests not capturing comments as part of readBalanced.

dginev avatar Sep 10 '25 15:09 dginev

Sorry, I guess I was thinking of the extra flags in #2630, which is unrelated. The thing with #2632 is that readBalanced is almost the last resort for the comments to come out anywhere related to their origin. If they get deferred any further, the probably should just be dropped altogether. Which may be a good thing, but ...

brucemiller avatar Sep 10 '25 17:09 brucemiller

I see. I guess I would need more examples - the PR deposits the comment in the exact place as the master LaTeXML does, for this issue's example. It ought to be whenever the next top-level "digestNextBody" call takes place.

dginev avatar Sep 10 '25 17:09 dginev

Interesting; so it still comes out soonish. But what does #2632 actually fix? Anything in siunitx? The current issue may be a surprise, but isn't necessarily a bug or regression. Deferring it, even to digestNextBody, will still give you a comment that may eventually surprise.

brucemiller avatar Sep 10 '25 18:09 brucemiller

Right. My understanding of the "wrongness" worth correcting was mostly about including "nearby" comments external to a macro argument into the interior content of the argument. That sort of context expansion means any macro can have a comment in its arguments at any time, depending on the author. Which feels a bit scary.

Instead, my PR skips the part where pending comments are added in. The internal comments read by readBalanced are still added.

And when digestNextBody grabs the comment from the pending stack, it won't pass it to any internal expansion logic, but it will directly absorb it "somewhere later" in the document. Which... is good enough? Maybe?

dginev avatar Sep 10 '25 18:09 dginev

Your patch will pass comments through to macros less often, but not "never", and they can still end up in primitive or constructor arguments, as well as internally "digested" things; they don't immediately & directly go into the XML.

brucemiller avatar Sep 10 '25 19:09 brucemiller