WeasyPrint Support XML input and <?xml-stylesheet?> instructions

Weasyprint already "accidentally" supports XML, because if you give an XML document to html5lib, it assumes that it's a fragment of malformed HTML and wraps it in <html> and <body> tags.

This is pretty handy since you don't actually have to needlessly apply XSLT to produce HTML before rendering it to PDF. Also because the usual solutions for transforming XML to PDF are closed source and horrendously expensive, so this is a super useful thing to be able to do.

But it's not clear how robust it is. So, since html5lib gives you an ElementTree.Element anyway, and since cssselect2 explicitly supports XML already, why not just parse XML as XML? Seems easy enough, here is a draft PR to do that.

Lacking:

tests, tests, and more tests
doesn't also support XSLT (XSLT for extraction/organization + CSS for layout is a fairly obvious use case) - this would require an lxml dependency
does it support namespaces? probably not, because I do not support namespaces (in the sense of "je ne les supporte pas")
UA stylesheets for HTML5 are probably mostly irrelevant or inappropriate for XML

Aug 18 '24 20:08 dhdaines

Hi!

Thanks a lot for the PR and for the description.

Weasyprint already "accidentally" supports XML, because if you give an XML document to html5lib, it assumes that it's a fragment of malformed HTML and wraps it in <html> and <body> tags.

It’s a bit more than an accident, it’s because HTML is a smart specification and html5lib a great implementation :smile:.

This is pretty handy since you don't actually have to needlessly apply XSLT to produce HTML before rendering it to PDF. Also because the usual solutions for transforming XML to PDF are closed source and horrendously expensive, so this is a super useful thing to be able to do.

I agree, it’s technically a great thing to do, but…

tests, tests, and more tests

…I suspect that we’ll find quite a lot of "small" bugs, and…

UA stylesheets for HTML5 are probably mostly irrelevant or inappropriate for XML

…and we never found the time to clean it for HTML (and I already tried many times).

doesn't also support XSLT (XSLT for extraction/organization + CSS for layout is a fairly obvious use case) - this would require an lxml dependency

Oh my…

does it support namespaces? probably not, because I do not support namespaces (in the sense of "je ne les supporte pas")

Nobody supports them… except the dozens of XML gurus who will open these legit bug reports just when the feature is announced.

"Namespaces with text encoded as windows-1252 hangs when tags contain entity-escaped slashes"

"Endless loop when URL in xml-stylesheet references a fragment of an embedded SVG use tag"

"Strange lxml crash in C code when CDATA contains escaped null characters separated by UTF8 surrogates"

"Please support HTML embedded in SVG2-like XML documents! (🙏🏽 nested CSS cascade required too 🙏🏽)"

And of course the PR just after:

"Rewrite cssselect2 in Rust for speed (only 43 tests failing, will fix them next month in another PR) please release now!!"

😱

More seriously, even if it’s a good idea, I doubt that we would be able to provide a solid support for this feature.

Let’s face it: html5lib has been abandoned for years. Taking care of a clean HTML5 support is so complex that nobody wants to do that anymore. We can’t maintain it, but we’re currently thinking about taking care of a simplified and modernized fork.

And it’s frightening. We already maintain our implementation of a CSS parser, a CSS cascade engine, a CSS layout engine, a PDF generation library. Next steps are the HTML parser and the web encodings manager (another unmaintained dependency).

Let’s keep XML and XSLT for next decade? 😄

Aug 18 '24 22:08 liZe

More seriously, even if it’s a good idea, I doubt that we would be able to provide a solid support for this feature.

For this reason alone, it makes sense not to accept this PR! If the goal is robustness then an XSLT transformation to (X)HTML, then rendering from HTML to PDF (which is what I'm currently doing) is definitely the most robust solution, which doesn't impose any extra dependencies or maintenance burden on you, the already overburdened maintainer of the transformation from HTML5 (a good standard, unlike XML) to PDF (not a good standard, but... the one we have).

I hacked this together in 15 minutes because I was curious to see if it worked, but feel free to close it! That said, what I actually want here is to be able to pass an ElementTree.Element along with some external CSS to the Weasyprint rendering code. Is this already possible? Could it be a subclass/method on weasyprint.HTML?

Now that I think of it I can just do this by subclassing weasyprint.HTML in my code (not yours). But it might be an easy and not too difficult to support addition to the API.

Aug 18 '24 23:08 dhdaines

"Namespaces with text encoded as windows-1252 hangs when tags contain entity-escaped slashes"

"Endless loop when URL in xml-stylesheet references a fragment of an embedded SVG use tag"

"Strange lxml crash in C code when CDATA contains escaped null characters separated by UTF8 surrogates"

"Please support HTML embedded in SVG2-like XML documents! (🙏🏽 nested CSS cascade required too 🙏🏽)"

😃😆🤣 (there, a few surrogate pairs for you...!)

Aug 18 '24 23:08 dhdaines

For this reason alone, it makes sense not to accept this PR! If the goal is robustness then an XSLT transformation to (X)HTML, then rendering from HTML to PDF (which is what I'm currently doing) is definitely the most robust solution, which doesn't impose any extra dependencies or maintenance burden on you, the already overburdened maintainer of the transformation from HTML5 (a good standard, unlike XML) to PDF (not a good standard, but... the one we have).

:smile:

I hacked this together in 15 minutes because I was curious to see if it worked, but feel free to close it! That said, what I actually want here is to be able to pass an ElementTree.Element along with some external CSS to the Weasyprint rendering code. Is this already possible? Could it be a subclass/method on weasyprint.HTML?

HTML is already a parameter of __main__.main so that we can easily test the CLI (see tests.testing_utils.FakeHTML), so I feel that everything you want is already possible.

Now that I think of it I can just do this by subclassing weasyprint.HTML in my code (not yours). But it might be an easy and not too difficult to support addition to the API.

Adding a "generic" API is difficult in this case, as everybody has different needs and wants to add just this little option that fits their needs. We already did this in another well-known library, and it didn’t end so well :smile:, we’d like to avoid this in WeasyPrint. One of the ~dangerous~ nice features of Python is that you can monkey-patch almost everything, and WeasyPrint’s unofficial internal API is often really stable. There’s no reason to only endure the speed penalty of dynamic languages, let’s have some fun!

Sep 25 '24 20:09 liZe

HTML is already a parameter of __main__.main so that we can easily test the CLI (see tests.testing_utils.FakeHTML), so I feel that everything you want is already possible.

Yes, exactly! I don't see a great need for explicit XML support in Weasyprint, I probably should have closed this myself but thanks :)

Sep 26 '24 20:09 dhdaines