core
core copied to clipboard
Workflow server
Implementation of the workflow server.
ocrd --help
Commands:
bashlib Work with bash library
log Logging
ocrd-tool Work with ocrd-tool.json JSON_FILE
validate All the validation in one CLI
process Run processor CLIs in a series of tasks
workflow Process a series of tasks
workspace Working with workspace
zip Bag/Spill/Validate OCRD-ZIP bags
ocrd workflow --help
Usage: ocrd workflow [OPTIONS] COMMAND [ARGS]...
Process a series of tasks
Options:
--help Show this message and exit.
Commands:
client Have the workflow server run commands
process Run processor CLIs in a series of tasks (alias for ``ocrd process``)
server Start server for a series of tasks to run processor CLIs or APIs...
ocrd workflow server --help
Usage: ocrd workflow server [OPTIONS] TASKS...
Start server for a series of tasks to run processor CLIs or APIs on
workspaces
Parse the given tasks and try to instantiate all Pythonic processors among
them with the given parameters. Open a web server that listens on the
given host and port for GET requests named ``process`` with the following
(URL-encoded) arguments:
mets (string): Path name (relative to the server's CWD,
or absolute) of the workspace to process
page_id (string): Comma-separated list of page IDs to process
overwrite (bool): Remove output pages/images if they already exist
The server will handle each request by running the tasks on the given
workspace. Pythonic processors will be run via API (on those same
instances). Non-Pythonic processors (or those not directly accessible in
the current venv) will be run via CLI normally, instantiating each time.
Also, between each contiguous chain of Pythonic tasks in the overall
series, no METS de/serialization will be performed.
Stop the server by sending SIGINT (e.g. via ctrl+c on the terminal), or
sending a GET request named ``shutdown``.
Options:
-l, --log-level [OFF|ERROR|WARN|INFO|DEBUG|TRACE]
Log level
-h, --host TEXT host name/IP to listen at
-p, --port INTEGER RANGE TCP port to listen at
--help Show this message and exit.
ocrd workflow client --help
Usage: ocrd workflow client [OPTIONS] COMMAND [ARGS]...
Have the workflow server run commands
Options:
-h, --host TEXT host name/IP to listen at
-p, --port INTEGER RANGE TCP port to listen at
--help Show this message and exit.
Commands:
list-tasks Have the workflow server print the configured task sequence
process Have the workflow server process another workspace
shutdown Have the workflow server shutdown gracefully
ocrd workflow client process --help
Usage: ocrd workflow client process [OPTIONS]
Have the workflow server process another workspace
Options:
-m, --mets TEXT METS to process
-g, --page-id TEXT ID(s) of the pages to process
--overwrite Remove output pages/images if they already exist
--help Show this message and exit.
Example:
ocrd workflow server -p 5000 'olena-binarize -P impl sauvola-ms-split -I OCR-D-IMG -O OCR-D-IMG-BINPAGE-sauvola' 'anybaseocr-crop -I OCR-D-IMG-BINPAGE-sauvola -O OCR-D-IMG-BINPAGE-sauvola-CROP' 'cis-ocropy-denoise -P noise_maxsize 3.0 -P level-of-operation page -I OCR-D-IMG-BINPAGE-sauvola-CROP -O OCR-D-IMG-BINPAGE-sauvola-CROP-DEN' 'tesserocr-deskew -P operation_level page -I OCR-D-IMG-BINPAGE-sauvola-CROP-DEN -O OCR-D-IMG-BINPAGE-sauvola-CROP-DEN-DESK-tesseract' 'cis-ocropy-deskew -P level-of-operation page -I OCR-D-IMG-BINPAGE-sauvola-CROP-DEN-DESK-tesseract -O OCR-D-IMG-BINPAGE-sauvola-CROP-DEN-DESK-tesseract-DESK-ocropy' 'tesserocr-recognize -P segmentation_level region -P model deu -I OCR-D-IMG-BINPAGE-sauvola-CROP-DEN-DESK-tesseract-DESK-ocropy -O OCR-D-IMG-BINPAGE-sauvola-CROP-DEN-DESK-tesseract-DESK-ocropy-OCR-tesseract-deu'
curl -G -d mets=../kant_aufklaerung_1784/data/mets.xml -d overwrite=True -d page_id=PHYS_0017,PHYS_0020 http://127.0.0.1:5000/process
curl -G -d mets=../blumenbach_anatomie_1805/mets.xml http://127.0.0.1:5000/process
curl -G http://127.0.0.1:5000/shutdown
# equivalently:
ocrd workflow client -p 5000 process -m ../kant_aufklaerung_1784/data/mets.xml --overwrite -g PHYS_0017,PHYS_0020
ocrd workflow client process -m ../blumenbach_anatomie_1805/mets.xml
ocrd workflow client shutdown
Please note that (as already explained here) this only inreases efficiency for Python processors in the current venv, so bashlib processors or sub-venv processors will still be run via CLI. So you might want to group them differently in ocrd_all, or cascade workflows across sub-venvs.
(EDITED to reflect ocrd workflow client addition)
And here's a recipe for doing workspace parallelization with the new workflow server and GNU parallel:
N=4
for ((i=0;i<N;i++)); do
ocrd workflow server -p $((5000+i)) \
"cis-ocropy-binarize -I OCR-D-IMG -O OCR-D-BIN" \
"anybaseocr-crop -I OCR-D-BIN -O OCR-D-CROP" \
"skimage-binarize -I OCR-D-CROP -O OCR-D-BIN2 -P method li" \
"skimage-denoise -I OCR-D-BIN2 -O OCR-D-BIN-DENOISE -P level-of-operation page" \
"tesserocr-deskew -I OCR-D-BIN-DENOISE -O OCR-D-BIN-DENOISE-DESKEW -P operation_level page" \
"tesserocr-segment-region -I OCR-D-BIN-DENOISE-DESKEW -O OCR-D-SEG-REG" \
"segment-repair -I OCR-D-SEG-REG -O OCR-D-SEG-REPAIR -P plausibilize true" \
"tesserocr-deskew -I OCR-D-SEG-REPAIR -O OCR-D-SEG-REG-DESKEW" \
"cis-ocropy-clip -I OCR-D-SEG-REG-DESKEW -O OCR-D-SEG-REG-DESKEW-CLIP" \
"tesserocr-segment-line -I OCR-D-SEG-REG-DESKEW-CLIP -O OCR-D-SEG-LINE" \
"cis-ocropy-clip -I OCR-D-SEG-LINE -O OCR-D-SEG-LINE-CLIP -P level-of-operation line" \
"cis-ocropy-dewarp -I OCR-D-SEG-LINE-CLIP -O OCR-D-SEG-LINE-RESEG-DEWARP" \
"tesserocr-recognize -I OCR-D-SEG-LINE-RESEG-DEWARP -O OCR-D-OCR -P textequiv_level glyph -P overwrite_words true -P model GT4HistOCR_50000000}"
done
find . -name mets.xml | parallel -j$N curl -G -d mets={} http://127.0.0.1:\$((5000+{%}))/process
# equivalently:
find . -name mets.xml | parallel -j$N ocrd workflow client -p \$((5000+{%})) process -m {}
(EDITED to reflect ocrd workflow client addition)
Codecov Report
Merging #652 (6d15084) into master (135acb6) will decrease coverage by
7.76%. The diff coverage is35.29%.
@@ Coverage Diff @@
## master #652 +/- ##
==========================================
- Coverage 84.34% 76.58% -7.77%
==========================================
Files 52 56 +4
Lines 3047 3587 +540
Branches 608 723 +115
==========================================
+ Hits 2570 2747 +177
- Misses 349 694 +345
- Partials 128 146 +18
| Impacted Files | Coverage Δ | |
|---|---|---|
| ocrd/ocrd/cli/workflow.py | 27.27% <27.27%> (ø) |
|
| ocrd/ocrd/processor/helpers.py | 70.40% <56.00%> (-10.37%) |
:arrow_down: |
| ocrd/ocrd/cli/__init__.py | 100.00% <100.00%> (ø) |
|
| ocrd/ocrd/cli/process.py | 93.33% <100.00%> (+0.47%) |
:arrow_up: |
| ocrd/ocrd/processor/base.py | 64.92% <100.00%> (-15.49%) |
:arrow_down: |
| ocrd_utils/ocrd_utils/os.py | 70.88% <0.00%> (-24.86%) |
:arrow_down: |
| ocrd_utils/ocrd_utils/constants.py | 90.47% <0.00%> (-9.53%) |
:arrow_down: |
| ocrd/ocrd/cli/workspace.py | 74.74% <0.00%> (-1.91%) |
:arrow_down: |
| ocrd/ocrd/__init__.py | 100.00% <0.00%> (ø) |
|
| ... and 15 more |
Continue to review full report at Codecov.
Legend - Click here to learn more
Δ = absolute <relative> (impact),ø = not affected,? = missing dataPowered by Codecov. Last update 135acb6...6d15084. Read the comment docs.
Please note that (as already explained here) this only inreases efficiency for Python processors in the current venv, so bashlib processors or sub-venv processors will still be run via CLI. So you might want to group them differently in ocrd_all, or cascade workflows across sub-venvs.
So here are examples for both options:
-
grouping sub-venvs in ocrd_all to include ocrd_calamari with all other needed modules in one venv
cd ocrd_all . venv/local/sub-venv/headless-tf1/bin/activate make MAKELEVEL=1 OCRD_MODULES="core ocrd_cis ocrd_anybaseocr ocrd_wrap ocrd_tesserocr tesserocr tesseract ocrd_calamari" all ocrd workflow server \ "cis-ocropy-binarize -I OCR-D-IMG -O OCR-D-BIN" \ "anybaseocr-crop -I OCR-D-BIN -O OCR-D-CROP" \ "skimage-binarize -I OCR-D-CROP -O OCR-D-BIN2 -P method li" \ "skimage-denoise -I OCR-D-BIN2 -O OCR-D-BIN-DENOISE -P level-of-operation page" \ "tesserocr-deskew -I OCR-D-BIN-DENOISE -O OCR-D-BIN-DENOISE-DESKEW -P operation_level page" \ "cis-ocropy-segment -I OCR-D-BIN-DENOISE-DESKEW -O OCR-D-SEG-REG -P level-of-operation page" \ "tesserocr-deskew -I OCR-D-SEG-REG -O OCR-D-SEG-REG-DESKEW" \ "cis-ocropy-dewarp -I OCR-D-SEG-REG-DESKEW -O OCR-D-SEG-LINE-RESEG-DEWARP" \ "calamari-recognize -I OCR-D-SEG-LINE-RESEG-DEWARP -O OCR-D-OCR -P checkpoint /path/to/models/\*.ckpt.json" -
cascade workflows across sub-venvs (combined with workspace parallelization):
N=4 for ((i=0;i<N;i++)); do . venv/bin/activate ocrd workflow server -p $((5000+i)) \ "cis-ocropy-binarize -I OCR-D-IMG -O OCR-D-BIN" . venv/local/sub-venv/headless-tf1/bin/activate ocrd workflow server -p $((6000+i)) \ "anybaseocr-crop -I OCR-D-BIN -O OCR-D-CROP" . venv/bin/activate ocrd workflow server -p $((7000+i)) \ "skimage-binarize -I OCR-D-CROP -O OCR-D-BIN2 -P method li" \ "skimage-denoise -I OCR-D-BIN2 -O OCR-D-BIN-DENOISE -P level-of-operation page" \ "tesserocr-deskew -I OCR-D-BIN-DENOISE -O OCR-D-BIN-DENOISE-DESKEW -P operation_level page" \ "cis-ocropy-segment -I OCR-D-BIN-DENOISE-DESKEW -O OCR-D-SEG-REG -P level-of-operation page" \ "tesserocr-deskew -I OCR-D-SEG-REG -O OCR-D-SEG-REG-DESKEW" \ "cis-ocropy-dewarp -I OCR-D-SEG-REG-DESKEW -O OCR-D-SEG-LINE-RESEG-DEWARP" . venv/local/sub-venv/headless-tf1/bin/activate ocrd workflow server -p $((8000+i)) \ "calamari-recognize -I OCR-D-SEG-LINE-RESEG-DEWARP -O OCR-D-OCR -P checkpoint /path/to/models/\*.ckpt.json" . venv/bin/activate done find . -name mets.xml | parallel -j$N \ ocrd workflow client -p \$((5000+{%})) process -m {} "&&" \ ocrd workflow client -p \$((6000+{%})) process -m {} "&&" \ ocrd workflow client -p \$((7000+{%})) process -m {} "&&" \ ocrd workflow client -p \$((8000+{%})) process -m {}
b4a8bcb0: Gosh, I managed to forget that one essential, distinctive change: triggering the actual instantiation!
(Must have slipped through somewhere on the way from the proof-of-concept standalone to this integrated formulation.)
I wonder if Flask (especially its server component, branded as development and has a debug option) is efficient enough. IIUC we actually need multi-threading in the server if we also want to spawn multiple processes for the worker (i.e. workspace parallelism). So maybe we should use a gevent or uwsgi server instead. (Of course, memory or GPU resources would need to be allocated to them carefully...)
But a multi-threaded server would require explicit session management in GPU-enabled processors (so startup and processing are in the same context). The latter is necessary anyway, though: Otherwise, when a workflow has multiple Tensorflow processors, they would steal each other's graphs.
Another thing that still bothers me is failover capacity. The workflow server should restart if any of its instances crash.
Another edge for performance might be transferring the workspace to fast storage during processing, like a RAM disk. So whenever a /process request arrives, clone the workspace to /tmp, run there, and then write back (overwriting). In Dockerization (simply defining a service via make server PORT=N and EXPOSE N), you would probably run with --mount type=tmpfs,destination=/tmp then.
On the other hand, that machinery could as well be handled outside the workflow server – so the workspace would already be on fast storage. But for the Docker option, that would be more complicated (data shared with the outside would still need to be slow, so a second service would need to take care of moving workspaces to and fro).
Codecov Report
Merging #652 (cac80d6) into master (db79ff6) will decrease coverage by
3.35%. The diff coverage is25.68%.
:exclamation: Current head cac80d6 differs from pull request most recent head d98daa8. Consider uploading reports for the commit d98daa8 to get more accurate results
@@ Coverage Diff @@
## master #652 +/- ##
==========================================
- Coverage 79.95% 76.60% -3.36%
==========================================
Files 56 58 +2
Lines 3488 3697 +209
Branches 706 734 +28
==========================================
+ Hits 2789 2832 +43
- Misses 565 723 +158
- Partials 134 142 +8
| Impacted Files | Coverage Δ | |
|---|---|---|
| ocrd/ocrd/cli/server.py | 0.00% <0.00%> (ø) |
|
| ocrd/ocrd/cli/workflow.py | 39.36% <39.36%> (ø) |
|
| ocrd/ocrd/processor/helpers.py | 70.40% <48.14%> (-10.37%) |
:arrow_down: |
| ocrd/ocrd/cli/__init__.py | 100.00% <100.00%> (ø) |
|
| ocrd/ocrd/cli/process.py | 93.33% <100.00%> (+0.47%) |
:arrow_up: |
| ocrd/ocrd/processor/base.py | 67.16% <100.00%> (+0.24%) |
:arrow_up: |
| ocrd/ocrd/resolver.py | 94.50% <0.00%> (-2.20%) |
:arrow_down: |
| ocrd/ocrd/workspace.py | 71.05% <0.00%> (-0.52%) |
:arrow_down: |
Continue to review full report at Codecov.
Legend - Click here to learn more
Δ = absolute <relative> (impact),ø = not affected,? = missing dataPowered by Codecov. Last update ecdb840...d98daa8. Read the comment docs.
@kba, note ccb369a is necessary for core in general since resmgr wants to be able to identify the startup CWD from the workspace. The other two recent commits add proper handling and passing of errors over the network.
Another thing that still bothers me is failover capacity. The workflow server should restart if any of its instances crash.
That's already covered under normal circumstances: run_api catches everything, logs it, and returns the exception. run_tasks gets that, and re-raises. cli.workflow.process catches everything, logs it, and sends the appropriate response. So the processor instances are still alive. Actual crashes like segfaults or pipe signals would drag down the server as well. So failover would be necessary above that anyway – for example in Docker.
But a multi-threaded server would require explicit session management in GPU-enabled processors (so startup and processing are in the same context). The latter is necessary anyway, though: Otherwise, when a workflow has multiple Tensorflow processors, they would steal each other's graphs.
I just confirmed this by testing: If I start a workflow with multiple TF processors, the last one will steal the others' session. You have to explicitly create sessions, store them into the instance, and reset prior to processing. (This means changes to all existing TF processors we have. See here and here for examples.)
I wonder if Flask (especially its server component, branded as
developmentand has adebugoption) is efficient enough. IIUC we actually need multi-threading in the server if we also want to spawn multiple processes for the worker (i.e. workspace parallelism). So maybe we should use ageventoruwsgiserver instead. (Of course, memory or GPU resources would need to be allocated to them carefully...)
Since Python's GIL prevents actual thread-level parallelism on shared resources (like processor instances), we'd have to do multi-processing anyway. I think I'll incorporate uwsgi (which does preforking) to achieve workspace parallelism. The server will have to do parse_tasks and instantiate in a dedicated function with Flask's decorator @app.before_first_request IIUC.
I think I'll incorporate uwsgi (which does preforking) to achieve workspace parallelism. The server will have to do
parse_tasksandinstantiatein a dedicated function with Flask's decorator@app.before_first_requestIIUC.
Done! (Works as expected, even with GPU-enabled processors sharing memory via growth.)
Updated help text for ocrd workflow server now reads:
Usage: ocrd workflow server [OPTIONS] TASKS...
Start server for a series of tasks to run processor CLIs or APIs on
workspaces
Parse the given tasks and try to instantiate all Pythonic processors among
them with the given parameters. Open a web server that listens on the
given ``host`` and ``port`` and queues requests into ``processes`` worker
processes for GET requests named ``/process`` with the following (URL-
encoded) arguments:
mets (string): Path name (relative to the server's CWD,
or absolute) of the workspace to process
page_id (string): Comma-separated list of page IDs to process
log_level (int): Override all logger levels during processing
overwrite (bool): Remove output pages/images if they already exist
The server will handle each request by running the tasks on the given
workspace. Pythonic processors will be run via API (on those same
instances). Non-Pythonic processors (or those not directly accessible in
the current venv) will be run via CLI normally, instantiating each time.
Also, between each contiguous chain of Pythonic tasks in the overall
series, no METS de/serialization will be performed.
If processing does not finish before ``timeout`` seconds per page, then
the request will fail and the respective worker be reloaded.
To see the server's workflow configuration, send a GET request named
``/list-tasks``.
Stop the server by sending SIGINT (e.g. via ctrl+c on the terminal), or
sending a GET request named ``/shutdown``.
Options:
-l, --log-level [OFF|ERROR|WARN|INFO|DEBUG|TRACE]
Log level
-t, --timeout INTEGER maximum processing time (in sec per page)
before reloading worker (0 to disable)
-j, --processes INTEGER number of parallel workers to spawn
-h, --host TEXT host name/IP to listen at
-p, --port INTEGER RANGE TCP port to listen at
--help Show this message and exit.
I just confirmed this by testing: If I start a workflow with multiple TF processors, the last one will steal the others' session. You have to explicitly create sessions, store them into the instance, and reset prior to processing. (This means changes to all existing TF processors we have. See here and here for examples.)
Let me recapitulate the whole issue of sharing GPUs (physically, esp. its RAM) and sharing Tensorflow sessions (logically, when run via API in the same preloading workflow server):
Normal computing resources like CPU, RAM and disk are shared naturally by the OS' task and I/O scheduler, the CPU's MMU and the disk driver's scheduler. Oversubscription can quickly become inefficient, but is easily avoidable by harmonizing the number of parallel jobs with the number of physical cores and their RAM outfit. Still, it can be worth risking transient oversubscription in exchange for a higher average resource utilization.
For GPU however, it's not that simple: GPURAM is not normally paged (because that would make it slow) or even swapped, hence is an exclusive resource which may result in OOM errors, and when processes need to wait for shaders, the benefit of using GPU over CPU in the first place might vanish. Therefore, a runtime system / framework like OCR-D needs to take extra care of strictly preventing GPU oversubscription, as even transient oversubscription usually leads to OOM failures. However, it's rather hard to anticipate what GPU resources a certain workflow configuration will need (both on average and at peak).
In the OCR-D makefilization, I chose to deal with GPU-enabled workflow steps bluntly by marking them as such and synchronizing all GPU-enabled processor runs via a semaphore (discounting current runners against the number of physical GPUs). But exclusive GPU locks/semaphores stop working when the processors get preloaded into memory, and often multiple processors could actually share a single GPU – it depends on how they are programmed and what size the models (or even the input images!) are.
For Tensorflow, the normal mode of operation is to allocate all available GPURAM on startup and thus use it exclusively throughout the process' lifetime. But there are more options:
ConfigProto().gpu_options.allow_growth: allocates dynamically as needed (slower during the first run; "needed" can still mean more than strictly necessary; no de-allocation; can still yield OOM)ConfigProto().gpu_options.per_process_gpu_memory_fraction: allocates no more than the given fraction of physical memory (much slower during the first run and slightly slower when competing for memory; can still yield OOM)ConfigProto().gpu_options.experimental.use_unified_memory: enables paging of memory to CPU (much slower on the first run and slightly slower when competing for memory; no more OOM)
So it now depends on how clever the processors are programmed, how many of them we want to run (sequentially in the same workflow, or in parallel) and how large the models/images are. I believe OCR-D needs to come up with conventions for advertising the number of GPU consumers and for writing processors sharing resources cooperatively (without giving up too much performance). If we have enough runtime configuration facilities, then the user/admin can at least dimension and optimise by experiment.
Of course, additionally having processing servers for the individual processors would also help better control resource allocation: each such server could be statically configured for full utilization and then dynamically distribute the work load from all processors (across pages / workflows / workspaces) in parallel (over shaders/memory) and sequentially (over a queue). Plus preloading and isolation would not fall onto the workflow server's shoulders anymore.
One of the largest advantages of using the workflow server thus is probably not reduced overhead (when you have small workspaces/documents) or latency, but the ability to scale across machines: as a network service, it can easily deployed across a Docker swarm for example.
In 6263bb1 I have started implementing the processing server pattern. It is completely indepent of the workflow server for now, so the latter still needs to be adapted to make proper use of the former. This could help overcome some of the problems with the current workflow server approach:
- not require the processors to be in the same venv (or even be Pythonic) anymore,
- not require monkey-patching Python classes anymore,
- not stick multiple tasks in the same process anymore (risking interference as in the Tensorflow case).
But the potential merits reach even further:
- run different parts of the workflow with different parallelism (slower tasks could be assigned more cores; GPU tasks could be assigned the exact number of GPU resources available while others get to see more CPU cores)
- orchestrating workflows incrementally and with signalling (progress meters) and timeouts
- encapsulating processor servers as Docker services, docker-composing (or even docker-swarming) workflows
(which means thatocrd/allwon't need to be a fat container anymore, and individual processor images can be based on the required CUDA runtime; currently we can only make exactly one CUDA/TF version work at the same time)
Some of the design decisions which will have to be made now:
- set up processing servers in advance (externally/separately) or managed by workflow server
- if external: task descriptions should contain the
--serverparameters - if managed: decide which ports to run them on, try setting them up, run ring back service and wait for startup notification, timeout and fall back to CLI, teardown managed PS PIDs/ports on WS termination
- if external: task descriptions should contain the
- what to do with non-Pythonic processors
- integrate a simple but transparent / automatic
socat-based server with means of the shell into bashlib - ignore them and always run as CLI
- integrate a simple but transparent / automatic
Hi @bertsky, after reading through the PR, this is my understanding about it. Please correct me if I'm wrong.
Understanding
Whenever we start a workflow server, we have to define a workflow for it. Since the workflow is defined at the instantiation time, one server can run one workflow only. A workflow here is a series of commands to trigger OCR processors with the respective input and output. So, the syntax is like this:
ocrd workflow server -p <port> <workflow>
Advantage
The benefit of this approach is that all necessary processors are already loaded into memory and ready to be used. So, if we want to run 1 workflow multiple times, this approach will save us time from loading and unloading models. But if we want to run another workflow, then we have to start another server with ocrd workflow server.
Disadvantage
Imagine we have a use case where users have workflow descriptions (in whatever language that we can parse) and want to execute them on our infrastructure. This ocrd workflow server is not an appropriate approach, since its advantage lies on the fact that the workflow is fixed (so that it can be loaded once and stay in the memory "forever"), 1 workflow - 1 server.
@tdoan2010 your understanding of the workflow server is correct. This PR also implements a processing server, but you did not address that part.
To your interpretation:
Disadvantage
Imagine we have a use case where users have workflow descriptions (in whatever language that we can parse) and want to execute them on our infrastructure. This
ocrd workflow serveris not an appropriate approach, since its advantage lies on the fact that the workflow is fixed (so that it can be loaded once and stay in the memory "forever"), 1 workflow - 1 server.
The current implementation also parses workflow descriptions (in the syntax of ocrd process). There is no loss of generality here – future syntaxes can of course be integrated as well.
Further, I fail to see the logic of why this is inappropriate for dynamically defined workflows. You can always start up another workflow server at run time if necessary. In fact, such servers could be automatically managed (set up / tear down, swarming) by a more high-level API service. And you don't loose anything by starting up all processors once and then processing. You only gain speed (starting with the non-first workspace to run on).
As argued in the original discussion memo cited at the top, overhead due to initialization is significant in OCR-D, especially with GPU processors, and the only way around this is workflow servers and/or processing servers.