llmware icon indicating copy to clipboard operation
llmware copied to clipboard

PDF files getting rejected in parse step

Open BakedJesus opened this issue 11 months ago • 4 comments

Out of a total of 4 files, 2 of my pdf files are being rejected. I've included one of the files as a sample.

My setup: Ubuntu 22.04 -> Anaconda python 3.10 docker compose up [mongo and milvus]

To reproduce, simply use library.add_files on my provided file. sample_rejected.pdf

BakedJesus avatar Mar 14 '24 11:03 BakedJesus

@BakedJesus - we will take a look. Its a big file (500+ pages), but that should be fine. A couple of quick checks-

  1. Did you see an "encrypted file" readout in the screen display while parsing? That would be one explanation - our parser does not attempt to decrypt a PDF if encrypted and the file is skipped.
  2. Was there a segfault or other crash? (Unlikely, but not impossible with a book-sized file with a lot of images that it may have tripped something.)
  3. It looks like the file has a lot of scanned content - the parser does not apply an OCR to that content, so it is also possible that a lot of the file content was "skipped" - and would need to apply an OCR to read the scanned images.

We will take a closer look at the file and come back

doberst avatar Mar 14 '24 14:03 doberst

Hey @doberst, thanks for responding.

  1. Not the file wasn't encrypted and there wasn't a readout
  2. No crashes. It just printed a list of rejected files
  3. Funnily enough, the one pdf it did manage to get was completely scanned! Whereas the two it rejected were at least digital copies.

I'm pasting the output when I call the parser directly on the pdf directory;

summary: pdf_parser - total pdf files processed - 4 summary: pdf_parser - total input files received - 4 summary: pdf_parser - total blocks created - 2700 summary: pdf_parser - total images created - 0 summary: pdf_parser - total tables created - 24 summary: pdf_parser - total pages added - 651 summary: pdf_parser - PDF Processing - Finished - time elapsed - 3.291197 update: pdf_parser - Completed Parsing - processing time - 3.291197 {'processed_files': [*HERE IT LISTS ACCEPTED FILES*], 'rejected_files': [*HERE IT LISTS THE TWO REJECTED FILES*], 'duplicate_files': []}

Do you have any suggestion on how I can debug this issue?

BakedJesus avatar Mar 15 '24 17:03 BakedJesus

Hi, I do have an issue with pdf_parser as well but a slightly different one. It does not reject a whole document but is able to extract just 16 pages out of 248 page document. Currently it is not possible to debug or investigate problems with the pdf_parser (or I haven't found one), would it be possible to share the source code of those binaries? Or make it possible to fallback to different implementation based on for example PyMuPDF? I do have my custom tokenizer based on PyMuPDF and it is able to read whole 248 pages. If I would know the expected contract for pdf_parser I could submit a PR with PyMuPDF implementation.

arekglowacki avatar Mar 16 '24 17:03 arekglowacki

:thinking: @arekglowacki May I ask you to open an issue that we can categories as a feature request? Thank you!

MacOS avatar Mar 16 '24 19:03 MacOS