rootstock icon indicating copy to clipboard operation
rootstock copied to clipboard

Use gotenberg for HTML to PDF conversion

Open dhimmel opened this issue 4 years ago • 9 comments
trafficstars

Originally mentioned by @agitter at https://github.com/manubot/rootstock/issues/393#issuecomment-733068778, gotenberg is a:

Docker-powered stateless API for converting HTML, Markdown and Office documents to PDF

Since we're looking at replacing athenapdf with pagedjs-cli in https://github.com/manubot/rootstock/issues/394, it also makes sense to evaluate gotenberg.

Links:

  • https://github.com/thecodingmachine/gotenberg
  • https://thecodingmachine.github.io/gotenberg/
  • https://hub.docker.com/r/thecodingmachine/gotenberg

dhimmel avatar Nov 29 '20 13:11 dhimmel

One challenge is that the Docker image is large: 844 MB for thecodingmachine/gotenberg:6.3.1. This compares to 291 MB for arachnysdocker/athenapdf:2.16.0

dhimmel avatar Nov 29 '20 13:11 dhimmel

conversion from a URL

First run the docker:

docker run --rm --publish 3000:3000 thecodingmachine/gotenberg:6.3

Second make an API call to export the manuscript

curl --request POST \
    --url http://localhost:3000/convert/url \
    --header 'Content-Type: multipart/form-data' \
    --form remoteURL=https://manubot.github.io/rootstock/v/97b294802ffcd39071b6e5b8ab59f60faf4be118/ \
    --output output/gotenberg.pdf

Result at gotenberg.pdf looks good (similar to athenapdf).

dhimmel avatar Nov 29 '20 14:11 dhimmel

@dhimmel It looks like manubot has settled on using WeasyPrint for HTML -> PDF conversion. Is this correct?

In my current manubot-like workflow (but not manubot) I use pandoc to generate JATS XML from markdown and then I generate HTML and PDF from JATS XML as an independent stage. I'm starting to think generating both HTML and PDF from the same JATS XML is a mistake. I'm now considering doing just JATS XML -> HTML -> PDF using WeasyPrint.

Any advice?

(It's a long explanation why I'm not doing markdown directly to HTML).

castedo avatar Oct 06 '22 21:10 castedo

I'm revisiting this after @vincerubinetti pointed out that athenapdf has been archived in https://github.com/manubot/rootstock/issues/254#issuecomment-1569088082

It may be time to look more seriously into pagedjs-cli versus gotenberg as an athenapdf replacement. Based on @dhimmel's old comment above, it looks like gotenberg worked in initial testing. The latest gotenberg image 7.8.3 is now somewhat smaller at 644MB.

agitter avatar May 30 '23 21:05 agitter

FWIW, I've gone pretty far down the WeasyPrint path and gotten good results. I've gotten good results in large part because I'm careful to use fairly old HTML/CSS features. An example is the PDF link off this page: https://popgen.es/H5NOlCVM9P5Vv4LbeuwJsaME8kM/1.1/ The PDF is by WeasyPrint from a subset of the webpage content.

I have decoupled much of the HTML/CSS implementation from the above example into a separate project: https://gitlab.com/castedo/printstrap/ to help others do similarly with WeasyPrint.

In particular you might be interested in the article.html example on the article branch: https://gitlab.com/castedo/printstrap/-/blob/article/example/article.html

castedo avatar May 30 '23 21:05 castedo

Also quick clarification: the article.html example in the article branch is actually much more advanced than the live example I give above on popgen.es today. The article.html example is a 2-column format kind of like eLife articles but is fully responsive with the PDF corresponding directly to the HTML content at a particular screen width.

castedo avatar May 30 '23 21:05 castedo

This discussion might be helpful in evaluating Chromium vs not:

https://github.com/singlesourcepub/community/discussions/49

I've partly gone down the WeasyPrint path because I hesitate to rely on Chromium. I consider it an open question whether Chromium is the right tool for specialized HTML -> PDF conversion where the HTML is high constrained and not really a full web page of a website.

castedo avatar May 30 '23 21:05 castedo

I consider it an open question whether Chromium is the right tool for specialized HTML -> PDF conversion where the HTML is high constrained and not really a full web page of a website.

I think Chromium is probably necessary. We need to rely on "newer" CSS properties sometimes, like overflow-wrap and word-break, which are not supported in Weasy. More importantly, we need to rely on JavaScript execution sometimes, like the attributes plugin way of merging table cells together. You could argue that we should find ways to statically do things at build time as much as possible, without javascript, but it would be a significant effort.

vincerubinetti avatar May 30 '23 22:05 vincerubinetti

Maybe we should also emphasize somewhere in the docs that as a last resort, one can manually print to pdf from the html version in any major browser.

vincerubinetti avatar May 31 '23 17:05 vincerubinetti