OCRmyPDF icon indicating copy to clipboard operation
OCRmyPDF copied to clipboard

Reduce memory usage for very large files (high page count and large file size)

Open jbarlow83 opened this issue 7 years ago • 7 comments

ocrmypdf uses excessive memory for files for very high page counts (hundreds), enough it might consume all available on temporary storage on small devices, e.g. 100 MB PDF produces >2GB of intermediates.

In the maximal case we need O(number of pages * number of intermediates per page). Currently we get major savings because some intermediates are just soft links to other intermediates.

Opportunities for savings:

  • [x] ~~PDF page split is breadth first, producing one file per page before some files are needed. It would be reasonable to have a group of 25 splitter.~~ Implemented in v7.
  • [ ] Intermediate temporary files could be deleted after every dependency interested in them has been consumed. Each object produced by the pipeline could have a reference count. Whenever a task finishes the reference count of each input is decreased. When a task is finished we ask how many input tasks want to see each output file and set the reference count accordingly.
  • [ ] Prioritize depth over breadth when a worker process is free to select a new task, if ruffus or its replacement doesn't already do this. Depth first topo ordering might get this for free.

jbarlow83 avatar Dec 09 '16 22:12 jbarlow83

Reference counting may be sufficient is probably the easiest way to get better performance here without complications.

One option would be to hardlink the input files for every time a task is run and delete them when it is done. That delegates refcounting to file system. Could fallback to cp if ln does not work.

/tmp/ocrmypdf-somefile/
    triage/
         origin
         origin.pdf
    repair/
         input.pdf  # hardlinked to origin.pdf
         input.repaired.pdf

jbarlow83 avatar Mar 01 '17 20:03 jbarlow83

Ruffus is not capable of a depth-first/greedy exploration of the pipeline DAG.

jbarlow83 avatar Mar 23 '18 18:03 jbarlow83

Improved for v7

jbarlow83 avatar Jun 23 '18 09:06 jbarlow83

Improved again for v8.4/v9

jbarlow83 avatar Jun 04 '19 09:06 jbarlow83

Why does Ocrmypdf create all the pdf files in one batch? And then it doesn't clean up its working directory until it's done.

I would expect it to extract pages as they are consumed by threads, and to delete everything besides the finalized page *.pdf as the page finishes processing.

As a result of this, if you have /tmp mounted on ram, you absolutely can OOM because of too many temporary files.

installgentoo avatar May 31 '23 15:05 installgentoo

That would not be good for me. Since I'm using OCRmyPDF in a (at the moment private) Plugin for Calibre with a proofreading step after the OCR process on scanned PDF documents, I depend on the existence of all temporary hocr files until I delete them by myself after proofreading.

bertholdm avatar May 31 '23 17:05 bertholdm

@installgentoo The working directory is definitely cleaned up at the end of processing. However, some intermediate resources are not deleted as soon as they could be. This hasn't been implemented because it mainly people with low temporary storage who are processing a lot of files, in most cases adding more storage is a good-enough workaround. As @bertholdm mentions, the introduction of plugins has made a lot of changes more complex.

jbarlow83 avatar Jun 01 '23 07:06 jbarlow83