rootstock
rootstock copied to clipboard
Switch from athenapdf to pagedjs-cli from HTML to PDF conversion
Athenapdf has worked well but has two problems:
- it appears to no longer be maintained
- it requires docker and has started to hit docker hub rate limits on CI https://github.com/manubot/rootstock/issues/393
From https://www.pagedjs.org/documentation/02-getting-started-with-paged-js/
The command line version of Paged.js uses a headless browser (a browser without any graphical interface) to generate a PDF. It can be run on the server to launch a headless Chromium in fully automated workflows. With the command line version, you don't need to call the Paged.js script in your document: it will be done automatically.
Links:
- https://www.npmjs.com/package/pagedjs-cli
- https://gitlab.pagedmedia.org/tools/pagedjs-cli
It looks like pagedjs-cli
is installed via npm, with a Dockerfile available such that we could also create an image if needed.
First step is to see whether pagedjs-cli has conversion fidelity as good or better than athenapdf.
I tried a few quick tests with the pagedjs-cli Docker image from DockerHub, which corresponds to version 0.0.9. I was able to convert a toy HTML file that had a single header and a single paragraph.
However, it hangs if I try to convert manuscript.html
from the rootstock output
branch. The output is
✔ Loaded
◷ Rendering: Page 581
where the page count continued increasing indefinitely until I killed it after 20 min. There's a good chance I'm doing something wrong or that it would work better by building the Docker image locally using their latest version of pagedjs.
If anyone wants to test the Docker image, the executable is ./bin/paged
not pagedjs-cli
.
I installed pagedjs-cli 0.1.1 from npm:
pagedjs-cli \
--page-size=A4 \
--inputs https://manubot.github.io/rootstock/v/97b294802ffcd39071b6e5b8ab59f60faf4be118/ \
--output output/pagedjs.pdf
Output:
✔ Loaded
✔ Rendering 10 pages took 1220.1599999971222 milliseconds.
✔ Generated
✔ Processed
✔ Saved to /home/dhimmel/Documents/repos/manubot-rootstock/output/pagedjs.pdf
Here's the rendered PDF: pagedjs.pdf. Compare to athenapdf PDF here generated from
https://github.com/manubot/rootstock/blob/97b294802ffcd39071b6e5b8ab59f60faf4be118/build/build.sh#L72-L75
Opened upstream issues for the problems:
Thanks for opening those issues. I’ll have a check in the morning as i believe we already had the issue with mathjax/scripts and fixed it. Also, the margins is something new, we never got it before, i’ll check in the morning
Here's something I hadn't considered until now: writing our own pdf conversion. It actually might not be as hard as we think... Take a look at this library:
https://github.com/Richienb/pdfly/blob/master/index.js
All we really need to do is have a way to programmatically open an instance of chrome (e.g. via Puppeteer) and print a document.
https://github.com/westmonroe/pdf-puppeteer#readme (javascript) https://github.com/miyakogi/pyppeteer (python)
All we really need to do is have a way to programmatically open an instance of chrome (e.g. via Puppeteer) and print a document.
That’s depend on how much functionnalities you’d like to support.
having a headless browser that generate a pdf is one thing, having a way to support css print features is way more complex (page number, cross references, footnotes, etc. for example —check the list here.
We’ve been working hard on the footnotes for the last 6 months or so, so we’re a little bit behind our timeline.
Especially as there is some cli update in the works. The issue opened are the ones we want to check as soon as the footnotes are shipped.
What are the feature you may want to use?
having a headless browser that generate a pdf is one thing, having a way to support css print features is way more complex (page number, cross references, footnotes, etc. for example —check the list here.
Yes, those features are difficult. Afaik we don't support those features yet, which is why I suggested using Puppeteer. But those features have been requested and are something that the team has wanted to support for a long time, so perhaps using Puppeteer wasn't a good suggestion in the long term. It could be something to switch to in the short term if Athena gives us problems though.
Fwiw, of that feature list, I believe the most requested ones were page numbers and footnotes.
so perhaps using Puppeteer wasn't a good suggestion in the long term
it’s a good starting point to see what’s doable :) Pagedjs uses pupeteer to generate the pdf from a pagedjs preview in a headless chrome, so yes, that’s the right idea.
Fwiw, of that feature list, I believe the most requested ones were page numbers and footnotes.
Awesome, we’re almost there with that (page number is already something that work fines (it’s easy to build table of content) :)
I’ll come back when our release is testable, so we’ll be able to help you if you wanna try it out.
Here's something I hadn't considered until now: writing our own pdf conversion
I'd strongly prefer if we could piggy back on an existing project, as I don't think we want the responsibility of maintaining a converter. Athena has worked quite well, but is no longer maintained. I think HTML-to-PDF is common enough of a conversion task we should be able to find existing projects with long-term backing. Time might be best spent contributing features to existing projects if there are small blockers for Manubot's use case.
The pagedjs feature list looks impressive. And it's affiliation with Cabbage Tree Labs, whose mission is to make publishing more open, is promising.
In my comment above, I linked to three issues that were potential blockers for Manubot to adopt pagedjs. I haven't gotten a reply on any of those issues. @julientaq is there a problem with notifications on the PagedMedia GitLab or insufficient developer bandwidth to respond to user feedback? We'd love to switch to pagedjs, and Manubot seems like an ideal use case for it, but we'll need the above issues looked at as well as a more confidence that the project will have the resources to deal with user requests and bug reports in a timely fashion.
Noting that the source code for pagedjs has been migrated from gitlab.pagedmedia.org
to gitlab.coko.foundation
, so the issue links above are broken. Here are updated links for these issues (although the original author and date metadata appears missing):
Interestingly, there is also a pagedjs github at https://github.com/pagedjs/pagedjs. Not clear if that repo or https://gitlab.coko.foundation/pagedjs/pagedjs is where contributions should occur. @fchasen (active contributor) might know? Also @fchasen any ability to look into the issues we posted?
Hi there!
I’m sorry, i completely miss your message (from last year, that not really acceptable, i’m sorry!)
So basically, our gitlab got completely screwed up by a couple of attaks and issues, and it was so silent that it wasn’t adressed for a while. And the github was supposedly a way to handle issues and merge requests coming in different places, but it’s not working as we’d hope (so long interoperability :-/).
So yes, we’re back in in coko’s gitlab, which is the right place to manage your issues.
I’ll check your issues right now!
@dhimmel do you have an account on gitlab.coko.foundation? So i can add you to the issues?
do you have an account on gitlab.coko.foundation
https://gitlab.coko.foundation/dhimmel
Please see this issue for another strong reason we need to abandon Athena:
https://github.com/greenelab/covid19-review/issues/1133
Key points:
Athena is using Electron 3.0.5. The current version of Electron is 18. Electron 3.0.5 is using Chromium version 66.0.3359.181. The current version of Chrome is ~100. Something about combining @media only screen
with a complex selector within it is causing an issue with the Chrome 66 print preview.