core icon indicating copy to clipboard operation
core copied to clipboard

Processor.resolve_resource: support on-demand download of URL values

Open kba opened this issue 3 years ago • 1 comments
trafficstars

With this in place, users can use URL directly for parameter values:

ocrd-tesserocr-recognize -P model https://github.com/tesseract-ocr/tessdata_best/raw/main/bos.traineddata

and it should download on demand the first time it encounters and registers the URL in the user resource_list.yml. Subsequent calls will use the cached download.

In practice though I cannot seem to find an example where this works:

  • ocrd_{tesserocr,cis-ocropy} have a different mechanism of model storage. It's still compatible with ocrd resmgr download in tesserocr's case but does not use the self.resolve_resource method this PR extends
  • ocrd_calamari requires a directory of files, or an archive which is too complex to do on demand in a generalized way IMHO
  • ocrd-page-transform is a bashlib processor and won't support this.

So if anybody has a good idea on how to test and/or generalize this to make it available to all the processors, pls let me know.

kba avatar Feb 14 '22 09:02 kba

In practice though I cannot seem to find an example where this works:

  • ocrd_{tesserocr,cis-ocropy} have a different mechanism of model storage. It's still compatible with ocrd resmgr download in tesserocr's case but does not use the self.resolve_resource method this PR extends

Yes. For Ocropy recognition, your https://github.com/cisocrgroup/ocrd_cis/pull/83 is long overdue.

And for Tesseract, I believe your https://github.com/OCR-D/ocrd_tesserocr/pull/176 could be rewritten such that instead of overriding the constructor (for the list_resources and show_resource cases), one would directly override module_dir, so (assuming core will have a mechanism of ensuring that list_all_resources and resolve_resource abide by its Processor.ocrd_tool['resource_locations']) everything will automagically make only the files from the module directory survive.

  • ocrd_calamari requires a directory of files, or an archive which is too complex to do on demand in a generalized way IMHO

Yes, Github directory downloads would be hard to implement. But IMO we can assume that model deployment for Calamari and and eynollah and sbb_binarize will involve release archives in the future.

  • ocrd-page-transform is a bashlib processor and won't support this.

With the recent changes to bashlib, this should work out of the box, though. (What happens is that it delegates to ocrd.cli.ocrd_tool's list-resources and show-resource, which in turn add the ocrd-tool.json directory, but also do the usual lookup under the other resource locations. Since we do not have a ocrd__resolve_resource builtin in bashlib yet, I delegate to ocrd__list_resources. But to ensure maximum interoperability, I just commited https://github.com/bertsky/workflow-configuration/commit/9f68fe83debb14a7948a0db026cb1794523c686b.)

A Pythonic test scenario would be running-downloading blla.mlmodel under https://github.com/OCR-D/ocrd_kraken/pull/33, or one of the ocrd_detectron2 config/model combinations.

bertsky avatar Feb 14 '22 13:02 bertsky