core
core copied to clipboard
METS Server based page paralellism for `ocrd process`
BTW, we could also provide this per-page parallelism recipe in core via Python. For the user, it could then look like
ocrd process --jobs 4 --timeout 2m --on-error=empty
Originally posted by @bertsky in https://github.com/OCR-D/ocrd-demo-mets-server/pull/3#discussion_r1422845725
To elaborate:
- [ ] add an option
--jobstoocrd processwhich would split the workspace into per-page pipelines synchronised via METS server and managed by Python's builtinmultiprocessingfacilities.
→ could also offer additional options (splitting up into chunks instead of pages...) - [ ] add another option
--timeout, applicable to the lowest substep (i.e. whole-workspace single-processor call normally, single-page single-processor call in parallel case)
→ now merely as a stopgap, later to be implemented inProcessor.process_pageandProcessor.process_workspacewhen we have the new processor API - [ ] add another option
--on-erroroffering various options (raise, ignore, skip, empty)
→ now merely as a stopgap, later to be implemented inProcessor.process_pageandProcessor.process_workspacewhen we have the new processor API including error handling
This has been superseded by the v3.0 API changes:
ocrd process --jobs 4 --timeout 2m --on-error=empty
… became …
OCRD_MAX_PARALLEL_PAGES=4 OCRD_PROCESSING_PAGE_TIMEOUT=120 OCRD_MISSING_OUTPUT=COPY ocrd process ...