ocrd_anybaseocr
ocrd_anybaseocr copied to clipboard
ocrd-anybaseocr-tiseg not applying default wiring
The --help of ocrd-anybaseocr-tiseg states a default wiring of ['OCR-D-IMG-CROP'] -> ['OCR-D-SEG-TISEG'].
root@38fa7aad0b43:/data/ocrd_workspace# ocrd-anybaseocr-tiseg --help
Using TensorFlow backend.
Usage: ocrd-anybaseocr-tiseg [OPTIONS]
separate text and non-text part with anyBaseOCR
Options:
-V, --version Show version
-l, --log-level [OFF|ERROR|WARN|INFO|DEBUG|TRACE]
Log level
-J, --dump-json Dump tool description as JSON and exit
-p, --parameter TEXT Parameters, either JSON string or path
JSON file
-g, --page-id TEXT ID(s) of the pages to process
-O, --output-file-grp TEXT File group(s) used as output.
-I, --input-file-grp TEXT File group(s) used as input.
-w, --working-dir TEXT Working Directory
-m, --mets TEXT METS to process
-h, --help This help message
Parameters:
"operation_level" [string - page] PAGE XML hierarchy level to operate
on Possible values: ["page", "region", "line"]
Default Wiring:
['OCR-D-IMG-CROP'] -> ['OCR-D-SEG-TISEG']
The workspace contains a file group named OCR-D-IMG-CROP, a corresponding folder exists.
root@38fa7aad0b43:/data/ocrd_workspace# ls -1
OCR-D-BINPAGE
OCR-D-CROP
OCR-D-DESKEW
OCR-D-IMG
OCR-D-IMG-BIN
OCR-D-IMG-CROP
OCR-D-IMG-DESKEW
mets.xml
root@38fa7aad0b43:/data/ocrd_workspace# ls -1
OCR-D-BINPAGE
OCR-D-CROP
OCR-D-DESKEW
OCR-D-IMG
OCR-D-IMG-BIN
OCR-D-IMG-CROP
OCR-D-IMG-DESKEW
mets.xml
I would expect that running orcd-anybaseocr-tiseg without any arguments would default to using OCR-D-IMG-CROP as input and OCR-D-SEG-TISEG as output. However, the program fails with the following error, because its using the non-existing INPUT as input and OUTPUT as output file group.
root@38fa7aad0b43:/data/ocrd_workspace# ocrd-anybaseocr-tiseg -m mets.xml
Using TensorFlow backend.
09:22:34.382 INFO ocrd.workspace_validator - input_file_grp=['INPUT'] output_file_grp=['OUTPUT']
Traceback (most recent call last):
File "/usr/bin/ocrd-anybaseocr-tiseg", line 8, in <module>
sys.exit(ocrd_anybaseocr_tiseg())
File "/usr/lib/python3.6/site-packages/click/core.py", line 764, in __call__
return self.main(*args, **kwargs)
File "/usr/lib/python3.6/site-packages/click/core.py", line 717, in main
rv = self.invoke(ctx)
File "/usr/lib/python3.6/site-packages/click/core.py", line 956, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/usr/lib/python3.6/site-packages/click/core.py", line 555, in invoke
return callback(*args, **kwargs)
File "/usr/lib/python3.6/site-packages/ocrd_anybaseocr/cli/cli.py", line 37, in ocrd_anybaseocr_tiseg
return ocrd_cli_wrap_processor(OcrdAnybaseocrTiseg, *args, **kwargs)
File "/usr/lib/python3.6/site-packages/ocrd/decorators.py", line 53, in ocrd_cli_wrap_processor
raise Exception("Invalid input/output file grps:\n\t%s" % '\n\t'.join(report.errors))
Exception: Invalid input/output file grps:
Input fileGrp[@USE='INPUT'] not in METS!
From what I can tell, this is due to class OcrdAnybaseocrTiseg(Processor) not overriding input_file_grp and output_file_grp in __init__, along the lines of:
kwargs['input_file_group'] = 'OCR-D-IMG-CROP'
kwargs['output_file_group'] = 'OCR-D-SEG-TISEG'
You are right, this should work as you expect. (At least as long as we keep describing it as default wiring.) But this has not been implemented yet in ocrd (the base package), cf. https://github.com/OCR-D/core/issues/274.
You have to call with explicit input and output file groups for now.