semantra
semantra copied to clipboard
PDF parsing error handling
Hi, it would be useful if some error handling was added in case a PDF fails to parse. I earlier got this error after parsing 1000s of PDFs and had to restart from scratch (not a big deal of course I used a small model for embedding but annoying if a large openai model would have been used).
(semantra) rico@xxx:~/src/semantra$ semantra --model sgpt-1.3B data/*pdf
semantra --model sgpt-1.3B data/test.pdf
test.pdf: 0%| | 0/1 [00:00<?, ?it/s]
Traceback (most recent call last):
File "/home/rico/.local/bin/semantra", line 8, in <module>
sys.exit(main())
File "/home/rico/.local/pipx/venvs/semantra/lib/python3.8/site-packages/click/core.py", line 1130, in __call__
return self.main(*args, **kwargs)
File "/home/rico/.local/pipx/venvs/semantra/lib/python3.8/site-packages/click/core.py", line 1055, in main
rv = self.invoke(ctx)
File "/home/rico/.local/pipx/venvs/semantra/lib/python3.8/site-packages/click/core.py", line 1404, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/home/rico/.local/pipx/venvs/semantra/lib/python3.8/site-packages/click/core.py", line 760, in invoke
return __callback(*args, **kwargs)
File "/home/rico/.local/pipx/venvs/semantra/lib/python3.8/site-packages/semantra/semantra.py", line 594, in main
documents[fn] = process(
File "/home/rico/.local/pipx/venvs/semantra/lib/python3.8/site-packages/semantra/semantra.py", line 146, in process
content = get_text_content(md5, filename, semantra_dir, force, silent)
File "/home/rico/.local/pipx/venvs/semantra/lib/python3.8/site-packages/semantra/semantra.py", line 45, in get_text_content
return get_pdf_content(md5, filename, semantra_dir, force, silent)
File "/home/rico/.local/pipx/venvs/semantra/lib/python3.8/site-packages/semantra/pdf.py", line 53, in get_pdf_content
pdf = pdfium.PdfDocument(filename)
File "/home/rico/.local/pipx/venvs/semantra/lib/python3.8/site-packages/pypdfium2/_helpers/document.py", line 86, in __init__
self.raw, to_hold, to_close = _open_pdf(self._input, self._password, self._autoclose)
File "/home/rico/.local/pipx/venvs/semantra/lib/python3.8/site-packages/pypdfium2/_helpers/document.py", line 721, in _open_pdf
raise PdfiumError(f"Failed to load document (PDFium: {consts.ErrorToStr.get(err_code)}).")
pypdfium2._helpers.misc.PdfiumError: Failed to load document (PDFium: Data format error).
Nevermind, I jsut realized it caches everything. Still nice to have the error handling though
I should probably make the cache handling more clear in the docs so folks are reassured.
Great point re: error handling. Logging an error message and continuing is the way to go here. Also, if there's a PDF that's not parsing correctly that should be (and you're comfortable sharing), let me know!
it was a fault on my end, the pdf was empty for some reason
I also realized the search is quite slow for 1000s of PDFs. Is this because I'm using a relatively big model or just because they're in PDF format? Would it be faster if it was raw text or if I use a smaller model?
Just here to +1, would be great to skip PDFs that have errors. It also currently breaks if it encounters any password protected PDFs (see below). Thank you for this very useful tool!
Traceback (most recent call last):
File "/Users/sameer/.local/bin/semantra", line 8, in <module>
sys.exit(main())
^^^^^^
File "/Users/sameer/.local/pipx/venvs/semantra/lib/python3.12/site-packages/click/core.py", line 1157, in __call__
return self.main(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/sameer/.local/pipx/venvs/semantra/lib/python3.12/site-packages/click/core.py", line 1078, in main
rv = self.invoke(ctx)
^^^^^^^^^^^^^^^^
File "/Users/sameer/.local/pipx/venvs/semantra/lib/python3.12/site-packages/click/core.py", line 1434, in invoke
return ctx.invoke(self.callback, **ctx.params)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/sameer/.local/pipx/venvs/semantra/lib/python3.12/site-packages/click/core.py", line 783, in invoke
return __callback(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/sameer/.local/pipx/venvs/semantra/lib/python3.12/site-packages/semantra/semantra.py", line 619, in main
documents[fn] = process(
^^^^^^^^
File "/Users/sameer/.local/pipx/venvs/semantra/lib/python3.12/site-packages/semantra/semantra.py", line 158, in process
content = get_text_content(md5, filename, semantra_dir, force, silent, encoding)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/sameer/.local/pipx/venvs/semantra/lib/python3.12/site-packages/semantra/semantra.py", line 50, in get_text_content
return get_pdf_content(md5, filename, semantra_dir, force, silent)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/sameer/.local/pipx/venvs/semantra/lib/python3.12/site-packages/semantra/pdf.py", line 53, in get_pdf_content
pdf = pdfium.PdfDocument(filename)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/sameer/.local/pipx/venvs/semantra/lib/python3.12/site-packages/pypdfium2/_helpers/document.py", line 78, in __init__
self.raw, to_hold, to_close = _open_pdf(self._input, self._password, self._autoclose)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/sameer/.local/pipx/venvs/semantra/lib/python3.12/site-packages/pypdfium2/_helpers/document.py", line 678, in _open_pdf
raise PdfiumError(f"Failed to load document (PDFium: {pdfium_i.ErrorToStr.get(err_code)}).")
pypdfium2._helpers.misc.PdfiumError: Failed to load document (PDFium: Incorrect password error).
-> Cannot close object, library is destroyed. This may cause a memory leak!