spec icon indicating copy to clipboard operation
spec copied to clipboard

ocrd-tool schema: be less restrictive on input/ouptut_filegrp

Open kba opened this issue 5 years ago • 3 comments

Designing the ocrd-sanitize processor makes it obvious that the input_file_grp/output_file_grp are not only obsolete but an anti-pattern:

  • The semantics are unintuitive or rather nonsensical, examples of how file groups MIGHT be named (but never are in practice) do not help
  • There are use cases where the input is not strictly mets:files from a mets:fileGrp. Since the core implementation depends on a single input file group (Processor.input_files property) we cannot just drop it though (hence the ["PLACEHOLDER"] default)
  • The naming restriction on top of input_file_grp/output_file_grp makes it worse

kba avatar Jul 25 '20 12:07 kba

Are you referring to OCR-D/core#274 by any chance?

That issue is obsolete, based on a confused understanding of the semantics of output_file_grp

I was mostly thinking of

    @property
    def input_files(self):
        """
        List the input files
        """
        return self.workspace.mets.find_files(fileGrp=self.input_file_grp, pageId=self.page_id)

in BaseProcessor. But thinking about it some more, if a processor does not set input_file_grp, it just should not use self.input_files. And the placeholder value in the ocrd-tool.json is unrelated to self.input_file_grp (the same mistaken interpretation as https://github.com/OCR-D/core/issues/274). So even the placeholders can go IMHO.

kba avatar Jul 27 '20 10:07 kba

It seems this went circle, but towards the general question of relaxing input/output fileGrp conventions: I would be fine with any form of relaxation that will still allows us to maintain direct ability to do a round trip DFG-Viewer-METS --> OCR-D -->DFG-Viewer-METS. E.g. this could be just one possible input setting/template for fileGrps?

cneud avatar Feb 04 '21 21:02 cneud

It seems this went circle, but towards the general question of relaxing input/output fileGrp conventions: I would be fine with any form of relaxation that will still allows us to maintain direct ability to do a round trip DFG-Viewer-METS --> OCR-D -->DFG-Viewer-METS. E.g. this could be just one possible input setting/template for fileGrps?

This PR is not about the relaxation of our fileGrp (METS) conventions, though. It is about the interpretation of the fileGrp section in the ocrd-tool.json, namely becoming a mere example of how the user could wire that processor in a workflow, and their defaults in case the developer did not specify any.

We've discussed templating fileGrp names for likely or simple workflows for a while, but agreed against this IIRC because this would almost always be impractical anyway, quickly clash, and raise more misconception and irritation on the user side than provide actual value.

But maybe I misunderstood what you meant by getting that roundtrip.

bertsky avatar Feb 05 '21 13:02 bertsky