llmware
llmware copied to clipboard
PDF files getting rejected in parse step
Out of a total of 4 files, 2 of my pdf files are being rejected. I've included one of the files as a sample.
My setup: Ubuntu 22.04 -> Anaconda python 3.10 docker compose up [mongo and milvus]
To reproduce, simply use library.add_files on my provided file. sample_rejected.pdf
@BakedJesus - we will take a look. Its a big file (500+ pages), but that should be fine. A couple of quick checks-
- Did you see an "encrypted file" readout in the screen display while parsing? That would be one explanation - our parser does not attempt to decrypt a PDF if encrypted and the file is skipped.
- Was there a segfault or other crash? (Unlikely, but not impossible with a book-sized file with a lot of images that it may have tripped something.)
- It looks like the file has a lot of scanned content - the parser does not apply an OCR to that content, so it is also possible that a lot of the file content was "skipped" - and would need to apply an OCR to read the scanned images.
We will take a closer look at the file and come back
Hey @doberst, thanks for responding.
- Not the file wasn't encrypted and there wasn't a readout
- No crashes. It just printed a list of rejected files
- Funnily enough, the one pdf it did manage to get was completely scanned! Whereas the two it rejected were at least digital copies.
I'm pasting the output when I call the parser directly on the pdf directory;
summary: pdf_parser - total pdf files processed - 4 summary: pdf_parser - total input files received - 4 summary: pdf_parser - total blocks created - 2700 summary: pdf_parser - total images created - 0 summary: pdf_parser - total tables created - 24 summary: pdf_parser - total pages added - 651 summary: pdf_parser - PDF Processing - Finished - time elapsed - 3.291197 update: pdf_parser - Completed Parsing - processing time - 3.291197 {'processed_files': [*HERE IT LISTS ACCEPTED FILES*], 'rejected_files': [*HERE IT LISTS THE TWO REJECTED FILES*], 'duplicate_files': []}
Do you have any suggestion on how I can debug this issue?
Hi, I do have an issue with pdf_parser as well but a slightly different one. It does not reject a whole document but is able to extract just 16 pages out of 248 page document. Currently it is not possible to debug or investigate problems with the pdf_parser (or I haven't found one), would it be possible to share the source code of those binaries? Or make it possible to fallback to different implementation based on for example PyMuPDF? I do have my custom tokenizer based on PyMuPDF and it is able to read whole 248 pages. If I would know the expected contract for pdf_parser I could submit a PR with PyMuPDF implementation.
:thinking: @arekglowacki May I ask you to open an issue that we can categories as a feature request? Thank you!