OCRmyPDF
OCRmyPDF copied to clipboard
Reduce memory usage for very large files (high page count and large file size)
ocrmypdf uses excessive memory for files for very high page counts (hundreds), enough it might consume all available on temporary storage on small devices, e.g. 100 MB PDF produces >2GB of intermediates.
In the maximal case we need O(number of pages * number of intermediates per page). Currently we get major savings because some intermediates are just soft links to other intermediates.
Opportunities for savings:
- [x] ~~PDF page split is breadth first, producing one file per page before some files are needed. It would be reasonable to have a group of 25 splitter.~~ Implemented in v7.
- [ ] Intermediate temporary files could be deleted after every dependency interested in them has been consumed. Each object produced by the pipeline could have a reference count. Whenever a task finishes the reference count of each input is decreased. When a task is finished we ask how many input tasks want to see each output file and set the reference count accordingly.
- [ ] Prioritize depth over breadth when a worker process is free to select a new task, if ruffus or its replacement doesn't already do this. Depth first topo ordering might get this for free.
Reference counting may be sufficient is probably the easiest way to get better performance here without complications.
One option would be to hardlink the input files for every time a task is run and delete them when it is done. That delegates refcounting to file system. Could fallback to cp if ln does not work.
/tmp/ocrmypdf-somefile/
triage/
origin
origin.pdf
repair/
input.pdf # hardlinked to origin.pdf
input.repaired.pdf
Ruffus is not capable of a depth-first/greedy exploration of the pipeline DAG.
Improved for v7
Improved again for v8.4/v9
Why does Ocrmypdf create all the pdf files in one batch? And then it doesn't clean up its working directory until it's done.
I would expect it to extract pages as they are consumed by threads, and to delete everything besides the finalized page *.pdf as the page finishes processing.
As a result of this, if you have /tmp mounted on ram, you absolutely can OOM because of too many temporary files.
That would not be good for me. Since I'm using OCRmyPDF in a (at the moment private) Plugin for Calibre with a proofreading step after the OCR process on scanned PDF documents, I depend on the existence of all temporary hocr files until I delete them by myself after proofreading.
@installgentoo The working directory is definitely cleaned up at the end of processing. However, some intermediate resources are not deleted as soon as they could be. This hasn't been implemented because it mainly people with low temporary storage who are processing a lot of files, in most cases adding more storage is a good-enough workaround. As @bertholdm mentions, the introduction of plugins has made a lot of changes more complex.