core
core copied to clipboard
`ocrd resmgr download '*'` weird behavior
When running ocrd resmgr download '*' in latest ocrd_all Docker image only some models are installed:
E.g. ocrd-tesserocr-recognize models missing entirely. ocrd resmgr download ocrd-tesserocr-recognize '*' working as expected.
So, something wrong with iterating over the processors for the wildcard case.
What does the resmgr log say?
What does the resmgr log say?
Nothing interesting, it only logs what it is downloading, not what it's supposed to be downloading or how it decided which processors should be included. I'll add a such a log statement when debugging.
Here is a snippet from my sbatch script that downloads all models:
singularity exec --bind "${SCRATCH_OCRD_MODELS_BASE}:/usr/local/share" "${SIF_PATH}" ocrd resmgr download '*'
singularity exec --bind "${SCRATCH_OCRD_MODELS_BASE}:/usr/local/share" "${SIF_PATH}" ocrd resmgr download ocrd-tesserocr-recognize '*'
In the scratch storage of the HPC environment
${SCRATCH_OCRD_MODELS_BASE} = /scratch1/users/mmustaf/ocrd_models
gwdu101:127 16:11:22 /scratch1/users/mmustaf/ocrd_models > du -ha
512 ./tessdata/configs/digits
512 ./tessdata/configs/box.train
512 ./tessdata/configs/unlv
512 ./tessdata/configs/hocr
512 ./tessdata/configs/pdf
512 ./tessdata/configs/ambigs.train
512 ./tessdata/configs/kannada
512 ./tessdata/configs/get.images
512 ./tessdata/configs/makebox
512 ./tessdata/configs/alto
512 ./tessdata/configs/linebox
512 ./tessdata/configs/api_config
512 ./tessdata/configs/bigram
512 ./tessdata/configs/bazaar
512 ./tessdata/configs/txt
512 ./tessdata/configs/lstmbox
512 ./tessdata/configs/tsv
512 ./tessdata/configs/logfile
512 ./tessdata/configs/box.train.stderr
512 ./tessdata/configs/quiet
512 ./tessdata/configs/wordstrbox
512 ./tessdata/configs/lstm.train
512 ./tessdata/configs/rebox
512 ./tessdata/configs/Makefile.am
512 ./tessdata/configs/inter
512 ./tessdata/configs/strokewidth
512 ./tessdata/configs/lstmdebug
14K ./tessdata/configs
2,2M ./tessdata/equ.traineddata
1,1M ./tessdata/Fraktur_GT4HistOCR.traineddata
11M ./tessdata/Fraktur.traineddata
4,2M ./tessdata/ONB.traineddata
4,0M ./tessdata/eng.traineddata
11M ./tessdata/osd.traineddata
6,2M ./tessdata/frk.traineddata
1,5M ./tessdata/deu.traineddata
3,3M ./tessdata/frak2021.traineddata
86M ./tessdata/Latin.traineddata
128M ./tessdata
80M ./ocrd-resources/ocrd-cis-ocropy-recognize/en-default.pyrnn.gz
17M ./ocrd-resources/ocrd-cis-ocropy-recognize/LatinHist.pyrnn.gz
42M ./ocrd-resources/ocrd-cis-ocropy-recognize/fraktur.pyrnn.gz
2,9M ./ocrd-resources/ocrd-cis-ocropy-recognize/fraktur-jze.pyrnn.gz
141M ./ocrd-resources/ocrd-cis-ocropy-recognize
18M ./ocrd-resources/ocrd-calamari-recognize/zpd-fraktur19/4.ckpt.h5
29K ./ocrd-resources/ocrd-calamari-recognize/zpd-fraktur19/1.ckpt.json
18M ./ocrd-resources/ocrd-calamari-recognize/zpd-fraktur19/0.ckpt.h5
29K ./ocrd-resources/ocrd-calamari-recognize/zpd-fraktur19/4.ckpt.json
29K ./ocrd-resources/ocrd-calamari-recognize/zpd-fraktur19/3.ckpt.json
29K ./ocrd-resources/ocrd-calamari-recognize/zpd-fraktur19/2.ckpt.json
18M ./ocrd-resources/ocrd-calamari-recognize/zpd-fraktur19/1.ckpt.h5
29K ./ocrd-resources/ocrd-calamari-recognize/zpd-fraktur19/0.ckpt.json
18M ./ocrd-resources/ocrd-calamari-recognize/zpd-fraktur19/2.ckpt.h5
18M ./ocrd-resources/ocrd-calamari-recognize/zpd-fraktur19/3.ckpt.h5
89M ./ocrd-resources/ocrd-calamari-recognize/zpd-fraktur19
19M ./ocrd-resources/ocrd-calamari-recognize/zpd-latin-script-hist-3/4.ckpt.h5
47K ./ocrd-resources/ocrd-calamari-recognize/zpd-latin-script-hist-3/1.ckpt.json
19M ./ocrd-resources/ocrd-calamari-recognize/zpd-latin-script-hist-3/0.ckpt.h5
47K ./ocrd-resources/ocrd-calamari-recognize/zpd-latin-script-hist-3/4.ckpt.json
47K ./ocrd-resources/ocrd-calamari-recognize/zpd-latin-script-hist-3/3.ckpt.json
47K ./ocrd-resources/ocrd-calamari-recognize/zpd-latin-script-hist-3/2.ckpt.json
19M ./ocrd-resources/ocrd-calamari-recognize/zpd-latin-script-hist-3/1.ckpt.h5
47K ./ocrd-resources/ocrd-calamari-recognize/zpd-latin-script-hist-3/0.ckpt.json
19M ./ocrd-resources/ocrd-calamari-recognize/zpd-latin-script-hist-3/2.ckpt.h5
19M ./ocrd-resources/ocrd-calamari-recognize/zpd-latin-script-hist-3/3.ckpt.h5
92M ./ocrd-resources/ocrd-calamari-recognize/zpd-latin-script-hist-3
19M ./ocrd-resources/ocrd-calamari-recognize/qurator-gt4histocr-1.0/2.ckpt.h5
24K ./ocrd-resources/ocrd-calamari-recognize/qurator-gt4histocr-1.0/1.ckpt.json
24K ./ocrd-resources/ocrd-calamari-recognize/qurator-gt4histocr-1.0/2.ckpt.json
24K ./ocrd-resources/ocrd-calamari-recognize/qurator-gt4histocr-1.0/4.ckpt.json
19M ./ocrd-resources/ocrd-calamari-recognize/qurator-gt4histocr-1.0/1.ckpt.h5
19M ./ocrd-resources/ocrd-calamari-recognize/qurator-gt4histocr-1.0/4.ckpt.h5
19M ./ocrd-resources/ocrd-calamari-recognize/qurator-gt4histocr-1.0/3.ckpt.h5
24K ./ocrd-resources/ocrd-calamari-recognize/qurator-gt4histocr-1.0/0.ckpt.json
24K ./ocrd-resources/ocrd-calamari-recognize/qurator-gt4histocr-1.0/3.ckpt.json
19M ./ocrd-resources/ocrd-calamari-recognize/qurator-gt4histocr-1.0/0.ckpt.h5
92M ./ocrd-resources/ocrd-calamari-recognize/qurator-gt4histocr-1.0
272M ./ocrd-resources/ocrd-calamari-recognize
147M ./ocrd-resources/ocrd-sbb-binarize/default-2021-03-09/model_bin_sbb_ens.h5
147M ./ocrd-resources/ocrd-sbb-binarize/default-2021-03-09
147M ./ocrd-resources/ocrd-sbb-binarize
560M ./ocrd-resources
687M .
For comparison check the models downloaded with older version (not sure which one, the latest one in January) of ocrd/all:maximum when ocrd-tesserocr-recognize models used to be located under ocrd-resources folder:
docker run --rm -v "/home/cloud/ocrd_models/:/usr/local/share/ocrd-resources" -- ocrd/all:maximum ocrd resmgr download '*'
In the Operandi live VM:
cloud@operandi-live:~/ocrd_models$ du -ha
2,8M ./ocrd-kraken-recognize/en_best.mlmodel
2,9M ./ocrd-kraken-recognize
438M ./ocrd-sbb-textline-detector/default/model_page_mixed_best.h5
438M ./ocrd-sbb-textline-detector/default/model_textline_new.h5
438M ./ocrd-sbb-textline-detector/default/model_strukturerkennung.h5
1,3G ./ocrd-sbb-textline-detector/default
1,3G ./ocrd-sbb-textline-detector
4,0K ./ocrd-anybaseocr-dewarp/latest_net_G.pth
8,0K ./ocrd-anybaseocr-dewarp
1,5M ./ocrd-tesserocr-recognize/deu.traineddata
2,2M ./ocrd-tesserocr-recognize/equ.traineddata
11M ./ocrd-tesserocr-recognize/Fraktur.traineddata
4,0M ./ocrd-tesserocr-recognize/eng.traineddata
6,2M ./ocrd-tesserocr-recognize/frk.traineddata
11M ./ocrd-tesserocr-recognize/osd.traineddata
3,3M ./ocrd-tesserocr-recognize/frak2021.traineddata
1,1M ./ocrd-tesserocr-recognize/Fraktur_GT4HistOCR.traineddata
4,2M ./ocrd-tesserocr-recognize/ONB.traineddata
4,0K ./ocrd-tesserocr-recognize/configs/get.images
4,0K ./ocrd-tesserocr-recognize/configs/lstmdebug
4,0K ./ocrd-tesserocr-recognize/configs/box.train
4,0K ./ocrd-tesserocr-recognize/configs/Makefile.am
4,0K ./ocrd-tesserocr-recognize/configs/lstmbox
4,0K ./ocrd-tesserocr-recognize/configs/api_config
4,0K ./ocrd-tesserocr-recognize/configs/kannada
4,0K ./ocrd-tesserocr-recognize/configs/wordstrbox
4,0K ./ocrd-tesserocr-recognize/configs/bazaar
4,0K ./ocrd-tesserocr-recognize/configs/box.train.stderr
4,0K ./ocrd-tesserocr-recognize/configs/strokewidth
4,0K ./ocrd-tesserocr-recognize/configs/txt
4,0K ./ocrd-tesserocr-recognize/configs/linebox
4,0K ./ocrd-tesserocr-recognize/configs/unlv
4,0K ./ocrd-tesserocr-recognize/configs/lstm.train
4,0K ./ocrd-tesserocr-recognize/configs/hocr
4,0K ./ocrd-tesserocr-recognize/configs/digits
4,0K ./ocrd-tesserocr-recognize/configs/logfile
4,0K ./ocrd-tesserocr-recognize/configs/inter
4,0K ./ocrd-tesserocr-recognize/configs/pdf
4,0K ./ocrd-tesserocr-recognize/configs/bigram
4,0K ./ocrd-tesserocr-recognize/configs/quiet
4,0K ./ocrd-tesserocr-recognize/configs/alto
4,0K ./ocrd-tesserocr-recognize/configs/tsv
4,0K ./ocrd-tesserocr-recognize/configs/makebox
4,0K ./ocrd-tesserocr-recognize/configs/rebox
4,0K ./ocrd-tesserocr-recognize/configs/ambigs.train
112K ./ocrd-tesserocr-recognize/configs
86M ./ocrd-tesserocr-recognize/Latin.traineddata
128M ./ocrd-tesserocr-recognize
4,9M ./ocrd-kraken-segment/blla.mlmodel
4,9M ./ocrd-kraken-segment
438M ./ocrd-sbb-binarize/default/model_bin3.h5
438M ./ocrd-sbb-binarize/default/model_bin2.h5
438M ./ocrd-sbb-binarize/default/model_bin1.h5
438M ./ocrd-sbb-binarize/default/model_bin4.h5
1,8G ./ocrd-sbb-binarize/default
147M ./ocrd-sbb-binarize/default-2021-03-09/model_bin_sbb_ens.h5
147M ./ocrd-sbb-binarize/default-2021-03-09
1,9G ./ocrd-sbb-binarize
4,0K ./ocrd-anybaseocr-tiseg/seg_model/assets
4,1M ./ocrd-anybaseocr-tiseg/seg_model/saved_model.pb
63M ./ocrd-anybaseocr-tiseg/seg_model/variables/variables.data-00001-of-00002
100K ./ocrd-anybaseocr-tiseg/seg_model/variables/variables.data-00000-of-00002
20K ./ocrd-anybaseocr-tiseg/seg_model/variables/variables.index
63M ./ocrd-anybaseocr-tiseg/seg_model/variables
67M ./ocrd-anybaseocr-tiseg/seg_model
67M ./ocrd-anybaseocr-tiseg
2,9M ./ocrd-cis-ocropy-recognize/fraktur-jze.pyrnn.gz
17M ./ocrd-cis-ocropy-recognize/LatinHist.pyrnn.gz
42M ./ocrd-cis-ocropy-recognize/fraktur.pyrnn.gz
80M ./ocrd-cis-ocropy-recognize/en-default.pyrnn.gz
141M ./ocrd-cis-ocropy-recognize
18M ./ocrd-calamari-recognize/zpd-fraktur19/0.ckpt.h5
32K ./ocrd-calamari-recognize/zpd-fraktur19/3.ckpt.json
18M ./ocrd-calamari-recognize/zpd-fraktur19/3.ckpt.h5
18M ./ocrd-calamari-recognize/zpd-fraktur19/1.ckpt.h5
32K ./ocrd-calamari-recognize/zpd-fraktur19/1.ckpt.json
32K ./ocrd-calamari-recognize/zpd-fraktur19/0.ckpt.json
18M ./ocrd-calamari-recognize/zpd-fraktur19/4.ckpt.h5
32K ./ocrd-calamari-recognize/zpd-fraktur19/2.ckpt.json
18M ./ocrd-calamari-recognize/zpd-fraktur19/2.ckpt.h5
32K ./ocrd-calamari-recognize/zpd-fraktur19/4.ckpt.json
89M ./ocrd-calamari-recognize/zpd-fraktur19
19M ./ocrd-calamari-recognize/qurator-gt4histocr-1.0/0.ckpt.h5
24K ./ocrd-calamari-recognize/qurator-gt4histocr-1.0/3.ckpt.json
19M ./ocrd-calamari-recognize/qurator-gt4histocr-1.0/3.ckpt.h5
19M ./ocrd-calamari-recognize/qurator-gt4histocr-1.0/1.ckpt.h5
24K ./ocrd-calamari-recognize/qurator-gt4histocr-1.0/1.ckpt.json
24K ./ocrd-calamari-recognize/qurator-gt4histocr-1.0/0.ckpt.json
19M ./ocrd-calamari-recognize/qurator-gt4histocr-1.0/4.ckpt.h5
24K ./ocrd-calamari-recognize/qurator-gt4histocr-1.0/2.ckpt.json
19M ./ocrd-calamari-recognize/qurator-gt4histocr-1.0/2.ckpt.h5
24K ./ocrd-calamari-recognize/qurator-gt4histocr-1.0/4.ckpt.json
92M ./ocrd-calamari-recognize/qurator-gt4histocr-1.0
19M ./ocrd-calamari-recognize/zpd-latin-script-hist-3/0.ckpt.h5
48K ./ocrd-calamari-recognize/zpd-latin-script-hist-3/3.ckpt.json
19M ./ocrd-calamari-recognize/zpd-latin-script-hist-3/3.ckpt.h5
19M ./ocrd-calamari-recognize/zpd-latin-script-hist-3/1.ckpt.h5
48K ./ocrd-calamari-recognize/zpd-latin-script-hist-3/1.ckpt.json
48K ./ocrd-calamari-recognize/zpd-latin-script-hist-3/0.ckpt.json
19M ./ocrd-calamari-recognize/zpd-latin-script-hist-3/4.ckpt.h5
48K ./ocrd-calamari-recognize/zpd-latin-script-hist-3/2.ckpt.json
19M ./ocrd-calamari-recognize/zpd-latin-script-hist-3/2.ckpt.h5
48K ./ocrd-calamari-recognize/zpd-latin-script-hist-3/4.ckpt.json
92M ./ocrd-calamari-recognize/zpd-latin-script-hist-3
272M ./ocrd-calamari-recognize
4,0K ./ocrd-anybaseocr-block-segmentation/block_segmentation_weights.h5
8,0K ./ocrd-anybaseocr-block-segmentation
28M ./ocrd-typegroups-classifier/densenet121.tgc
28M ./ocrd-typegroups-classifier
147M ./ocrd-eynollah-segment/default/model_tables_ens_mixed_new_2.h5
147M ./ocrd-eynollah-segment/default/model_textline_newspapers.h5
147M ./ocrd-eynollah-segment/default/model_main_covid19_lr5-5_scale_1_1_great.h5
147M ./ocrd-eynollah-segment/default/model_page_mixed_best.h5
127M ./ocrd-eynollah-segment/default/model_enhancement.h5
147M ./ocrd-eynollah-segment/default/model_bin_sbb_ens.h5
147M ./ocrd-eynollah-segment/default/model_3up_new_good_no_augmentation.h5
99M ./ocrd-eynollah-segment/default/model_scale_classifier.h5
147M ./ocrd-eynollah-segment/default/model_no_patches_class0_30eopch.h5
147M ./ocrd-eynollah-segment/default/model_main_home_corona3_rot.h5
147M ./ocrd-eynollah-segment/default/model_ensemble_s.h5
1,6G ./ocrd-eynollah-segment/default
1,6G ./ocrd-eynollah-segment
4,0K ./ocrd-anybaseocr-layout-analysis/structure_analysis/assets
14M ./ocrd-anybaseocr-layout-analysis/structure_analysis/saved_model.pb
29M ./ocrd-anybaseocr-layout-analysis/structure_analysis/variables/variables.data-00001-of-00002
248K ./ocrd-anybaseocr-layout-analysis/structure_analysis/variables/variables.data-00000-of-00002
44K ./ocrd-anybaseocr-layout-analysis/structure_analysis/variables/variables.index
30M ./ocrd-anybaseocr-layout-analysis/structure_analysis/variables
43M ./ocrd-anybaseocr-layout-analysis/structure_analysis
4,0K ./ocrd-anybaseocr-layout-analysis/mapping_densenet.pickle
43M ./ocrd-anybaseocr-layout-analysis
5,4G .
The models are way less than what they used to be. The total size of the downloaded models is just 687MB. It used to be around 5.4GB. Also some processor models are now completely missing or not downloaded at all.
It's clear the reason for this is that ResourceManager.list_available only returns database results – it does not look up all ocrd- executables in PATH. (For comparison, ResourceManager.list_installed returns database results and all resource location paths with ocrd- prefix, which is somewhat better, but still misses out on processors' module locations, as in ocrd_tesserocr.) The database then is simply the distributed resource_list.yml plus any user resources.yml. At no time do we guarantee that the latter gets filled from PATH dynamically!
I cannot find when exactly this broke, but this change looks somewhat fishy.
Since we never know when the user installs (additional) processor modules, and the database files can be out of date (as is currently the case with the distributed resource_list.yml which still contains sbb-textline-detector), IMO the correct behaviour would be:
list-available *: unless short-circuited with ocrd-all-tool.json, and unlessdynamic=False, look up allocrd-executables in PATH via--dump-json, add their resouce specs to the user database, and then output all known resourceslist-installed *: unless short-circuited with ocrd-all-tool.json, and unlessdynamic=False, look up allocrd-executables in PATH via--dump-json, add their resouce specs to the user database, and then look up all known resource locations
Speaking of short-circuiting with ocrd-all-tool.json: we do not have a dedicated issue for that, but since it's probably tied to the solution here, anyway: The idea would be to have a lookup mechanism like for ocrd_logging.conf (i.e. system location, XDG-based user location, CWD) as an opt-in for ocrd-all-tool.json. If that file can be found, then replace all dynamic lookups with queries into the list of all tools and their resources. (Of course, relying on that file creates new problems like keeping ocrd-all-tool.json up to date if you install more tools, but let's first concentrate on the substantial performance gains that this will yield.)
I've opened a separate issue for the ocrd-all-tool.json aspect in https://github.com/OCR-D/core/issues/1059