docling
docling copied to clipboard
fix: PermissionError when using tesseract_ocr_cli_model
- Make sure that the
tesseract_ocr_cli_model.pydoes not open the png image file twice (before it was opened once withtempfile.NamedTemporaryFileand again by passing the file name rather than the file object itself tohigh_res_image.save; - Ensure that
_run_tesseractis executed after the file is no longer open by python. This otherwise results in a "PermissionError: [Errno 13] Permission denied" error on Windows.
Merge Protections
Your pull request matches the following merge protections and will not be merged until they are valid.
🔴 Require two reviewer for test updates
This rule is failing.
When test data is updated, we require two reviewers
- [ ]
#approved-reviews-by >= 2
🟢 Enforce conventional commit
Wonderful, this rule succeeded.
Make sure that we follow https://www.conventionalcommits.org/en/v1.0.0/
- [X]
title ~= ^(fix|feat|docs|style|refactor|perf|test|build|ci|chore|revert)(?:\(.+\))?:
Thanks @gaspardpetit for this fix. Here a few notes
- Please review the test failures. I think you are missing the pre-commit styling, which could be fixed with
pre-commit run --all-files - I was reading a bit more about the issue in the python docs
- it would be better to use
delete_on_close=Falseand avoid the extratry/finally. Can you please give it a try?
- it would be better to use
Thank you for reviewing this pull request. I have fixed the import order according the the pre-commit linter.
I disagree with using delete_on_close=False (with delete=True and without the try/finally blocks):
With delete_on_close=False, temporary files will survive until the script terminates. On a script being used to convert a large number of pdf files, each containing potentially hundreds of pages, this would resulting thousand of temporary image files being deleted only on termination. Within containers with limited storage, this could be a huge disadvantage. It may also make the termination of the script appear to hang and possibly timeout in some scenarios, leading to the process being killed and the files leaking.
Let me know if you would still prefer to use delete_on_close=False and I will amend my pull request accordingly.
I think we both want to achieve the same, but we interpret the tempfile library documentation in different ways 😉
Let me elaborate my understanding of it.
- When the file is used within a
withcontext. The argumentdelete=Trueis doing exactly what you propose with thefinally:. At the end of the context, the file should be removed. - Some operating system (I think mostly Windows) have special handlers which would delete the file, when the
.close()method is called on it, even if still within the with statement. This is what I think is causing the error you reported.
My understanding of the tempfile options is that using delete_on_close=False and delete=True (within a with: context) would achieve what we both want: 1) keep the file alive until the OCR is done, 2) delete it as soon it is no longer needed.
As posted initially, this is just an understanding of the documentation, and I don't have a way to test it on Windows. If you want to give it a try and see it is matching the expectations above, we could simplify the fix and achieve what we both want.
Note: the CI is currently reporting
- one of your commit is missing the DCO sign-off. Can you please fix it with
git rebase HEAD~15 --signoff? - the diff currently looks weird. I'm wondering if something went wrong in merging with the latest main branch.
Hi! Thank you for reviewing this.
You are right, I might have merged incorrectly - I will probably end up submitting again from a clean branch.
Just to confirm that I understand correctly - using delete_on_close will also bump the python requirements to 3.12 and I think the docling currently supports 3.9. Do we want that?
using
delete_on_closewill also bump the python requirements to3.12
Good catch. Sorry, I didn't notice it before. Then it is definitely a no-go.
Then I fully support your solution with delete=False.
Brilliant, I'll resubmit as a clean branch in a couple of hours then, thank !
Replaced by clean merge request https://github.com/DS4SD/docling/pull/496