ocrd_anybaseocr icon indicating copy to clipboard operation
ocrd_anybaseocr copied to clipboard

ocrd-anybaseocr-tiseg not applying default wiring

Open sepastian opened this issue 5 years ago • 1 comments

The --help of ocrd-anybaseocr-tiseg states a default wiring of ['OCR-D-IMG-CROP'] -> ['OCR-D-SEG-TISEG'].

root@38fa7aad0b43:/data/ocrd_workspace# ocrd-anybaseocr-tiseg --help
Using TensorFlow backend.

Usage: ocrd-anybaseocr-tiseg [OPTIONS]
  
  separate text and non-text part with anyBaseOCR

Options:
  -V, --version                   Show version
  -l, --log-level [OFF|ERROR|WARN|INFO|DEBUG|TRACE]
                                  Log level
  -J, --dump-json                 Dump tool description as JSON and exit
  -p, --parameter TEXT            Parameters, either JSON string or path
                                  JSON file
  -g, --page-id TEXT              ID(s) of the pages to process
  -O, --output-file-grp TEXT      File group(s) used as output.
  -I, --input-file-grp TEXT       File group(s) used as input.
  -w, --working-dir TEXT          Working Directory
  -m, --mets TEXT                 METS to process
  -h, --help                      This help message

Parameters:
  "operation_level" [string - page] PAGE XML hierarchy level to operate
      on Possible values: ["page", "region", "line"]

Default Wiring:
  ['OCR-D-IMG-CROP'] -> ['OCR-D-SEG-TISEG']

The workspace contains a file group named OCR-D-IMG-CROP, a corresponding folder exists.

root@38fa7aad0b43:/data/ocrd_workspace# ls -1
OCR-D-BINPAGE
OCR-D-CROP
OCR-D-DESKEW
OCR-D-IMG
OCR-D-IMG-BIN
OCR-D-IMG-CROP
OCR-D-IMG-DESKEW
mets.xml
root@38fa7aad0b43:/data/ocrd_workspace# ls -1
OCR-D-BINPAGE
OCR-D-CROP
OCR-D-DESKEW
OCR-D-IMG
OCR-D-IMG-BIN
OCR-D-IMG-CROP
OCR-D-IMG-DESKEW
mets.xml

I would expect that running orcd-anybaseocr-tiseg without any arguments would default to using OCR-D-IMG-CROP as input and OCR-D-SEG-TISEG as output. However, the program fails with the following error, because its using the non-existing INPUT as input and OUTPUT as output file group.

root@38fa7aad0b43:/data/ocrd_workspace# ocrd-anybaseocr-tiseg -m mets.xml 
Using TensorFlow backend.
09:22:34.382 INFO ocrd.workspace_validator - input_file_grp=['INPUT'] output_file_grp=['OUTPUT']
Traceback (most recent call last):
  File "/usr/bin/ocrd-anybaseocr-tiseg", line 8, in <module>
    sys.exit(ocrd_anybaseocr_tiseg())
  File "/usr/lib/python3.6/site-packages/click/core.py", line 764, in __call__
    return self.main(*args, **kwargs)
  File "/usr/lib/python3.6/site-packages/click/core.py", line 717, in main
    rv = self.invoke(ctx)
  File "/usr/lib/python3.6/site-packages/click/core.py", line 956, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/usr/lib/python3.6/site-packages/click/core.py", line 555, in invoke
    return callback(*args, **kwargs)
  File "/usr/lib/python3.6/site-packages/ocrd_anybaseocr/cli/cli.py", line 37, in ocrd_anybaseocr_tiseg
    return ocrd_cli_wrap_processor(OcrdAnybaseocrTiseg, *args, **kwargs)
  File "/usr/lib/python3.6/site-packages/ocrd/decorators.py", line 53, in ocrd_cli_wrap_processor
    raise Exception("Invalid input/output file grps:\n\t%s" % '\n\t'.join(report.errors))
Exception: Invalid input/output file grps:
        Input fileGrp[@USE='INPUT'] not in METS!

From what I can tell, this is due to class OcrdAnybaseocrTiseg(Processor) not overriding input_file_grp and output_file_grp in __init__, along the lines of:

kwargs['input_file_group'] = 'OCR-D-IMG-CROP'
kwargs['output_file_group'] = 'OCR-D-SEG-TISEG'

sepastian avatar Mar 13 '20 09:03 sepastian

You are right, this should work as you expect. (At least as long as we keep describing it as default wiring.) But this has not been implemented yet in ocrd (the base package), cf. https://github.com/OCR-D/core/issues/274.

You have to call with explicit input and output file groups for now.

bertsky avatar Apr 06 '20 10:04 bertsky