core icon indicating copy to clipboard operation
core copied to clipboard

METS Server based page paralellism for `ocrd process`

Open kba opened this issue 2 years ago • 1 comments

          BTW, we could also provide this per-page parallelism recipe in core via Python. For the user, it could then look like

ocrd process --jobs 4 --timeout 2m --on-error=empty

Originally posted by @bertsky in https://github.com/OCR-D/ocrd-demo-mets-server/pull/3#discussion_r1422845725

kba avatar Dec 13 '23 13:12 kba

To elaborate:

  • [ ] add an option --jobs to ocrd process which would split the workspace into per-page pipelines synchronised via METS server and managed by Python's builtin multiprocessing facilities.
    → could also offer additional options (splitting up into chunks instead of pages...)
  • [ ] add another option --timeout, applicable to the lowest substep (i.e. whole-workspace single-processor call normally, single-page single-processor call in parallel case)
    → now merely as a stopgap, later to be implemented in Processor.process_page and Processor.process_workspace when we have the new processor API
  • [ ] add another option --on-error offering various options (raise, ignore, skip, empty)
    → now merely as a stopgap, later to be implemented in Processor.process_page and Processor.process_workspace when we have the new processor API including error handling

bertsky avatar Dec 13 '23 13:12 bertsky

This has been superseded by the v3.0 API changes:

ocrd process --jobs 4 --timeout 2m --on-error=empty

… became …

OCRD_MAX_PARALLEL_PAGES=4 OCRD_PROCESSING_PAGE_TIMEOUT=120 OCRD_MISSING_OUTPUT=COPY ocrd process ...

bertsky avatar Jul 03 '25 23:07 bertsky