Lurnby PDF parsing is poor

PDF parsing is poor

Open Roznoshchik opened this issue 2 years ago • 6 comments

The current pdf library leaves a lot to be desired.

It only works for simple pdfs with plain images And text.

Anything more complex that has graphs, charts, etc, comes through very poorly.

One idea is to just work with Pdfs as images. And then possibly do an OCR on the text content.

But there is a lot that needs to be Explored there to render things properly so that it works with lurnby.

Feb 12 '22 12:02 Roznoshchik

I have had great success with parsing very complex pdf to html using pdf2htmlex, especially this fork https://github.com/pdf2htmlEX/pdf2htmlEX (the original is unmaintained). Doesn't do ocr though.

Feb 14 '22 16:02 Artaud

Thanks @Artaud,

I'll try to play with this and see how it works. I think my biggest concern is how it would work on mobile, but I guess that should be secondary to actually having it work for the majority of inputs.

A brief look at some of the samples, showed that on mobile there isn't any rerendering, the whole page just shrinks to a tiny size.

Feb 14 '22 21:02 Roznoshchik

Looking at this closer, pdf2htmlEX does seem promising, but it's not a python package. Which means to use it on Heroku where I'm currently hosting the app, would require some extra work.

I'm not sure how to compile C apps to run on Heroku, so the best bet seems to be to convert to a docker deployment and deploy the docker image to heroku.

I've started that process, but it involves quite a lot of changes so will see how it goes.

Feb 23 '22 11:02 Roznoshchik

I was able to get pdf2htmlEX running on the docker container, but it's not working with some of the pdfs. Likely some missing font libraries.

But on closer look I realized that I was mostly able to get the same output using pymupdf which I was already using. I just wasn't using the automatic html conversion. I was building the html manually.

And I remembered why I made that decision. Both pymupdf and pdf2htmlEX convert the pdf to html, but they do so with a lot of inline css to render the page exactly the same.

This kills many of Lurnby's reader functions like dark/light mode, font size adjustments, etc. And makes it a bit annoying to try and highlight text due to the way it's rendered. Removing the inline css also doesn't lead to great layouts.

All of this is maybe fine, but the way in which I'm currently rendering the article content into the reader means that many pdfs, even those converted to html using those libraries will completely break and destroy the page layout. To pursue that option, I would need to render a separate reader for pdfs to account for any changes.

Which isn't necessarily a bad thing. Just requires a lot more research and testing to determine if that's the best way forward or not.

Another not so great option that I'm considering is to work with pdfs in image format. pymupdf has an option to convert a pdf page to an image. This has it's own drawbacks obviously. The text isn't selectable, it doesn't work for mobile and desktop, etc.

But, it aligns with another feature I'm considering which is the ability to highlight images.

I'm looking at incorporating Mozilla's screenshot library.

This would allow me to capture a portion of a page and then save that image. This way, an image pdf would possibly still be able to be annotated and worked on.

In short, looking at a bunch of seemingly sub optimal options.

Feb 24 '22 09:02 Roznoshchik

@Roznoshchik do we have any updates on this?

Jun 09 '23 08:06 ghost

No unfortunately. I have been too busy to be able to do anything on this and the readwise team has been killing it, so it hasn't felt like there was a strong need for this.

I personally haven't been reading to many pdfs either so it hasn't been a priority.

Jun 09 '23 08:06 Roznoshchik

Lurnby Lurnby copied to clipboard

PDF parsing is poor

Lurnby
Lurnby copied to clipboard