Read two PDFs. Compare. Redline.

Open houfu opened this issue 3 years ago • 6 comments

What I want to do

Given two pdfs, read the text found on them, and produce a redline.

How I might be able to do this.

Using a PDF library like pdfminer, produce a list of paragraphs and compare them. Produce a new PDF of the source, and mark them with the changes.

Limitations

OCR is probably a future feature.
Layout changes might be a future feature.

Apr 11 '22 03:04 houfu

If you declared a solid pipeline of where it should be placed in the code, I can contribute that features mining and extracting via OCR

Jul 12 '23 07:07 HRNPH

In my mind this is probably a very important and big feature. What's the minimum feature set? Read and extract only the text (without formatting and pagination) and compare? 🤔

For pipelines, maybe needs a bit of refactoring.

Jul 14 '23 00:07 houfu

I think we should only did it in Text-PDF via some PDF extractor and not image pdf https://www.javatpoint.com/python-libraries-for-pdf-extraction if we use OCR it'll be a waste of time since the text still need to be cleaned after, let's leave the extraction to other tools

Jul 14 '23 14:07 HRNPH

@HRNPH The latest commit (#28) provides an example pipeline for files. Are you still interested in taking a stab on PDF files? Let me know your thoughts (including which PDF library you are thinking of using)!

Sep 10 '23 07:09 houfu

Now open to others to try before I do it myself lol.

Sep 21 '23 09:09 houfu