semantra PDF parsing error handling

Hi, it would be useful if some error handling was added in case a PDF fails to parse. I earlier got this error after parsing 1000s of PDFs and had to restart from scratch (not a big deal of course I used a small model for embedding but annoying if a large openai model would have been used).

(semantra) rico@xxx:~/src/semantra$ semantra --model sgpt-1.3B data/*pdf
semantra --model sgpt-1.3B data/test.pdf 
test.pdf:   0%|  | 0/1 [00:00<?, ?it/s]
Traceback (most recent call last):
  File "/home/rico/.local/bin/semantra", line 8, in <module>
    sys.exit(main())
  File "/home/rico/.local/pipx/venvs/semantra/lib/python3.8/site-packages/click/core.py", line 1130, in __call__
    return self.main(*args, **kwargs)
  File "/home/rico/.local/pipx/venvs/semantra/lib/python3.8/site-packages/click/core.py", line 1055, in main
    rv = self.invoke(ctx)
  File "/home/rico/.local/pipx/venvs/semantra/lib/python3.8/site-packages/click/core.py", line 1404, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/rico/.local/pipx/venvs/semantra/lib/python3.8/site-packages/click/core.py", line 760, in invoke
    return __callback(*args, **kwargs)
  File "/home/rico/.local/pipx/venvs/semantra/lib/python3.8/site-packages/semantra/semantra.py", line 594, in main
    documents[fn] = process(
  File "/home/rico/.local/pipx/venvs/semantra/lib/python3.8/site-packages/semantra/semantra.py", line 146, in process
    content = get_text_content(md5, filename, semantra_dir, force, silent)
  File "/home/rico/.local/pipx/venvs/semantra/lib/python3.8/site-packages/semantra/semantra.py", line 45, in get_text_content
    return get_pdf_content(md5, filename, semantra_dir, force, silent)
  File "/home/rico/.local/pipx/venvs/semantra/lib/python3.8/site-packages/semantra/pdf.py", line 53, in get_pdf_content
    pdf = pdfium.PdfDocument(filename)
  File "/home/rico/.local/pipx/venvs/semantra/lib/python3.8/site-packages/pypdfium2/_helpers/document.py", line 86, in __init__
    self.raw, to_hold, to_close = _open_pdf(self._input, self._password, self._autoclose)
  File "/home/rico/.local/pipx/venvs/semantra/lib/python3.8/site-packages/pypdfium2/_helpers/document.py", line 721, in _open_pdf
    raise PdfiumError(f"Failed to load document (PDFium: {consts.ErrorToStr.get(err_code)}).")
pypdfium2._helpers.misc.PdfiumError: Failed to load document (PDFium: Data format error).

Apr 26 '23 20:04 ricomnl

Nevermind, I jsut realized it caches everything. Still nice to have the error handling though

Apr 26 '23 20:04 ricomnl

I should probably make the cache handling more clear in the docs so folks are reassured.

Great point re: error handling. Logging an error message and continuing is the way to go here. Also, if there's a PDF that's not parsing correctly that should be (and you're comfortable sharing), let me know!

Apr 27 '23 04:04 freedmand

it was a fault on my end, the pdf was empty for some reason

Apr 28 '23 16:04 ricomnl

I also realized the search is quite slow for 1000s of PDFs. Is this because I'm using a relatively big model or just because they're in PDF format? Would it be faster if it was raw text or if I use a smaller model?

Apr 28 '23 16:04 ricomnl

Just here to +1, would be great to skip PDFs that have errors. It also currently breaks if it encounters any password protected PDFs (see below). Thank you for this very useful tool!

Traceback (most recent call last):
  File "/Users/sameer/.local/bin/semantra", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/Users/sameer/.local/pipx/venvs/semantra/lib/python3.12/site-packages/click/core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/sameer/.local/pipx/venvs/semantra/lib/python3.12/site-packages/click/core.py", line 1078, in main
    rv = self.invoke(ctx)
         ^^^^^^^^^^^^^^^^
  File "/Users/sameer/.local/pipx/venvs/semantra/lib/python3.12/site-packages/click/core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/sameer/.local/pipx/venvs/semantra/lib/python3.12/site-packages/click/core.py", line 783, in invoke
    return __callback(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/sameer/.local/pipx/venvs/semantra/lib/python3.12/site-packages/semantra/semantra.py", line 619, in main
    documents[fn] = process(
                    ^^^^^^^^
  File "/Users/sameer/.local/pipx/venvs/semantra/lib/python3.12/site-packages/semantra/semantra.py", line 158, in process
    content = get_text_content(md5, filename, semantra_dir, force, silent, encoding)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/sameer/.local/pipx/venvs/semantra/lib/python3.12/site-packages/semantra/semantra.py", line 50, in get_text_content
    return get_pdf_content(md5, filename, semantra_dir, force, silent)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/sameer/.local/pipx/venvs/semantra/lib/python3.12/site-packages/semantra/pdf.py", line 53, in get_pdf_content
    pdf = pdfium.PdfDocument(filename)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/sameer/.local/pipx/venvs/semantra/lib/python3.12/site-packages/pypdfium2/_helpers/document.py", line 78, in __init__
    self.raw, to_hold, to_close = _open_pdf(self._input, self._password, self._autoclose)
                                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/sameer/.local/pipx/venvs/semantra/lib/python3.12/site-packages/pypdfium2/_helpers/document.py", line 678, in _open_pdf
    raise PdfiumError(f"Failed to load document (PDFium: {pdfium_i.ErrorToStr.get(err_code)}).")
pypdfium2._helpers.misc.PdfiumError: Failed to load document (PDFium: Incorrect password error).
-> Cannot close object, library is destroyed. This may cause a memory leak!

Jun 24 '24 21:06 sam33r

semantra semantra copied to clipboard

PDF parsing error handling

semantra
semantra copied to clipboard