core
core copied to clipboard
Processor.resolve_resource: support on-demand download of URL values
With this in place, users can use URL directly for parameter values:
ocrd-tesserocr-recognize -P model https://github.com/tesseract-ocr/tessdata_best/raw/main/bos.traineddata
and it should download on demand the first time it encounters and registers the URL in the user resource_list.yml. Subsequent calls will use the cached download.
In practice though I cannot seem to find an example where this works:
ocrd_{tesserocr,cis-ocropy}have a different mechanism of model storage. It's still compatible withocrd resmgr downloadin tesserocr's case but does not use theself.resolve_resourcemethod this PR extendsocrd_calamarirequires a directory of files, or an archive which is too complex to do on demand in a generalized way IMHOocrd-page-transformis a bashlib processor and won't support this.
So if anybody has a good idea on how to test and/or generalize this to make it available to all the processors, pls let me know.
In practice though I cannot seem to find an example where this works:
ocrd_{tesserocr,cis-ocropy}have a different mechanism of model storage. It's still compatible withocrd resmgr downloadin tesserocr's case but does not use theself.resolve_resourcemethod this PR extends
Yes. For Ocropy recognition, your https://github.com/cisocrgroup/ocrd_cis/pull/83 is long overdue.
And for Tesseract, I believe your https://github.com/OCR-D/ocrd_tesserocr/pull/176 could be rewritten such that instead of overriding the constructor (for the list_resources and show_resource cases), one would directly override module_dir, so (assuming core will have a mechanism of ensuring that list_all_resources and resolve_resource abide by its Processor.ocrd_tool['resource_locations']) everything will automagically make only the files from the module directory survive.
ocrd_calamarirequires a directory of files, or an archive which is too complex to do on demand in a generalized way IMHO
Yes, Github directory downloads would be hard to implement. But IMO we can assume that model deployment for Calamari and and eynollah and sbb_binarize will involve release archives in the future.
ocrd-page-transformis a bashlib processor and won't support this.
With the recent changes to bashlib, this should work out of the box, though. (What happens is that it delegates to ocrd.cli.ocrd_tool's list-resources and show-resource, which in turn add the ocrd-tool.json directory, but also do the usual lookup under the other resource locations. Since we do not have a ocrd__resolve_resource builtin in bashlib yet, I delegate to ocrd__list_resources. But to ensure maximum interoperability, I just commited https://github.com/bertsky/workflow-configuration/commit/9f68fe83debb14a7948a0db026cb1794523c686b.)
A Pythonic test scenario would be running-downloading blla.mlmodel under https://github.com/OCR-D/ocrd_kraken/pull/33, or one of the ocrd_detectron2 config/model combinations.