docling icon indicating copy to clipboard operation
docling copied to clipboard

fix: PermissionError when using tesseract_ocr_cli_model

Open gaspardpetit opened this issue 1 year ago • 2 comments

  • Make sure that the tesseract_ocr_cli_model.py does not open the png image file twice (before it was opened once with tempfile.NamedTemporaryFile and again by passing the file name rather than the file object itself to high_res_image.save;
  • Ensure that _run_tesseract is executed after the file is no longer open by python. This otherwise results in a "PermissionError: [Errno 13] Permission denied" error on Windows.

gaspardpetit avatar Nov 25 '24 14:11 gaspardpetit

Merge Protections

Your pull request matches the following merge protections and will not be merged until they are valid.

🔴 Require two reviewer for test updates

This rule is failing.

When test data is updated, we require two reviewers

  • [ ] #approved-reviews-by >= 2

🟢 Enforce conventional commit

Wonderful, this rule succeeded.

Make sure that we follow https://www.conventionalcommits.org/en/v1.0.0/

  • [X] title ~= ^(fix|feat|docs|style|refactor|perf|test|build|ci|chore|revert)(?:\(.+\))?:

mergify[bot] avatar Nov 25 '24 14:11 mergify[bot]

Thanks @gaspardpetit for this fix. Here a few notes

  1. Please review the test failures. I think you are missing the pre-commit styling, which could be fixed with pre-commit run --all-files
  2. I was reading a bit more about the issue in the python docs
    • it would be better to use delete_on_close=False and avoid the extra try/finally. Can you please give it a try?

dolfim-ibm avatar Nov 26 '24 06:11 dolfim-ibm

Thank you for reviewing this pull request. I have fixed the import order according the the pre-commit linter.

I disagree with using delete_on_close=False (with delete=True and without the try/finally blocks):

With delete_on_close=False, temporary files will survive until the script terminates. On a script being used to convert a large number of pdf files, each containing potentially hundreds of pages, this would resulting thousand of temporary image files being deleted only on termination. Within containers with limited storage, this could be a huge disadvantage. It may also make the termination of the script appear to hang and possibly timeout in some scenarios, leading to the process being killed and the files leaking.

Let me know if you would still prefer to use delete_on_close=False and I will amend my pull request accordingly.

gaspardpetit avatar Dec 01 '24 16:12 gaspardpetit

I think we both want to achieve the same, but we interpret the tempfile library documentation in different ways 😉

Let me elaborate my understanding of it.

  • When the file is used within a with context. The argument delete=True is doing exactly what you propose with the finally:. At the end of the context, the file should be removed.
  • Some operating system (I think mostly Windows) have special handlers which would delete the file, when the .close() method is called on it, even if still within the with statement. This is what I think is causing the error you reported.

My understanding of the tempfile options is that using delete_on_close=False and delete=True (within a with: context) would achieve what we both want: 1) keep the file alive until the OCR is done, 2) delete it as soon it is no longer needed.

As posted initially, this is just an understanding of the documentation, and I don't have a way to test it on Windows. If you want to give it a try and see it is matching the expectations above, we could simplify the fix and achieve what we both want.

dolfim-ibm avatar Dec 02 '24 07:12 dolfim-ibm

Note: the CI is currently reporting

  1. one of your commit is missing the DCO sign-off. Can you please fix it with git rebase HEAD~15 --signoff?
  2. the diff currently looks weird. I'm wondering if something went wrong in merging with the latest main branch.

dolfim-ibm avatar Dec 02 '24 07:12 dolfim-ibm

Hi! Thank you for reviewing this.

You are right, I might have merged incorrectly - I will probably end up submitting again from a clean branch.

Just to confirm that I understand correctly - using delete_on_close will also bump the python requirements to 3.12 and I think the docling currently supports 3.9. Do we want that?

gaspardpetit avatar Dec 02 '24 13:12 gaspardpetit

using delete_on_close will also bump the python requirements to 3.12

Good catch. Sorry, I didn't notice it before. Then it is definitely a no-go.

Then I fully support your solution with delete=False.

dolfim-ibm avatar Dec 02 '24 13:12 dolfim-ibm

Brilliant, I'll resubmit as a clean branch in a couple of hours then, thank !

gaspardpetit avatar Dec 02 '24 13:12 gaspardpetit

Replaced by clean merge request https://github.com/DS4SD/docling/pull/496

gaspardpetit avatar Dec 03 '24 03:12 gaspardpetit