crs-reports-website icon indicating copy to clipboard operation
crs-reports-website copied to clipboard

Make improvements to ePub

Open vdavez opened this issue 1 year ago • 8 comments

✨ Added some new functionality for ePub!

This commit aims at improving the ePub experience for CRS Reports.

It adds a new dependency (PyMuPDF) to handle PDF parsing and generates HTML from that PDF that is aimed at conversion to ePub.

I also made some changes so I could get everything to work locally.

vdavez avatar Mar 30 '23 11:03 vdavez

Neat!

When we stopped getting HTML data directly and only PDFs, I hadn't thought about how that would impact the epub conversion.

Hopefully we'll get HTML data again some day, so it might make sense to preserve the old part under an if statement if an HTML format exists.

I also wonder whether this technique can be used to make a better HTML display of the reports on the site itself instead of using pdftohtml which has given not-so-good results from CRS PDFs?

JoshData avatar Mar 30 '23 15:03 JoshData

Hopefully we'll get HTML data again some day, so it might make sense to preserve the old part under an if statement if an HTML format exists.

I can adjust that.

I also wonder whether this technique can be used to make a better HTML display of the reports on the site itself instead of using pdftohtml which has given not-so-good results from CRS PDFs?

It's possible to get "pixel perfect" HTML with pymupdf because it handles all of the layout stuff. Depending on the use case, maybe that's a good option?

vdavez avatar Mar 30 '23 16:03 vdavez

Created a ZIP file with sample ePUBs generated by the script. @DanielSchuman would you mind looking at them at letting me know whether you want any additional tweaks here?

Archive.zip

vdavez avatar Apr 01 '23 16:04 vdavez

@JoshData Assuming that there aren't any major changes we want in terms of formatting, I think this will be ready for review. In terms of the HTML generation, I think it's a good idea, but probably will want to treat it as a separate development branch because it will likely need to modify process_incoming.py and we'd want to add some logic around "source" HTML versus "generated" HTML when establishing the json file for the report. Thoughts?

vdavez avatar Apr 01 '23 16:04 vdavez

I'm trying to deploy this but I think the AWS machine that builds the website each day may be too old for pymupdf --- I'm getting an error installing it. The machine was spun up a long time ago and has Python 3.4. It's running Amazon Linux AMI 2018.03.

Probably it's possible to install pymupdf from source but I can't make time for that today (or soon, tbh).

JoshData avatar Jul 09 '23 15:07 JoshData

Let me know if we should move to a more modern machine. Happy to be available to work with you on it.

Thanks for looking at it, Josh.

On Sun, Jul 9, 2023 at 11:58 AM Joshua Tauberer @.***> wrote:

I'm trying to deploy this but I think the AWS machine that builds the website each day may be too old for pymupdf --- I'm getting an error installing it. The machine was spun up a long time ago and has Python 3.4. It's running Amazon Linux AMI 2018.03.

Probably it's possible to install pymupdf from source https://pymupdf.readthedocs.io/en/latest/installation.html but I can't make time for that today (or soon, tbh).

— Reply to this email directly, view it on GitHub https://github.com/JoshData/crs-reports-website/pull/15#issuecomment-1627758023, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAWRVUGXNDNI6HEDIAG6FS3XPLILVANCNFSM6AAAAAAWNEYCV4 . You are receiving this because you were mentioned.Message ID: @.***>

--

  • Daniel (Please excuse typos, send by phone )

DanielSchuman avatar Jul 09 '23 16:07 DanielSchuman

Moving to a newer machine doesn't have much upside other than this PR at the moment, and it comes with the risk that something else will break and need time to fix, so I'm not too enthusiastic about it.

JoshData avatar Jul 09 '23 21:07 JoshData

@vdavez We've moved EveryCRSReport to a new server. It's now using pymupdf for generating HTML versions of new reports going forward (for the main page of every report, if it works I'll regenerate all of the old ones from pdftohtml). If all goes well, we can just feed this into the rest of your changes in this PR.

JoshData avatar Aug 12 '24 20:08 JoshData