schema_salad icon indicating copy to clipboard operation
schema_salad copied to clipboard

metaschema predicates pollute salad document namespace

Open mr-c opened this issue 3 years ago • 8 comments

Alternative title: CWL enum symbols get fully expanded instead of being left as pure strings

Source issue: https://github.com/common-workflow-lab/cwljava/issues/70

For example, CommandInputEnumSchema https://github.com/common-workflow-language/cwl-v1.2/blob/a0f2d38e37ff51721fdeaf993bb2ab474b17246b/CommandLineTool.yml#L355 extends InputEnumSchema https://github.com/common-workflow-language/cwl-v1.2/blob/a0f2d38e37ff51721fdeaf993bb2ab474b17246b/Process.yml#L689 which extends sld:EnumSchemahttps://github.com/common-workflow-language/cwl-v1.2/blob/a0f2d38e37ff51721fdeaf993bb2ab474b17246b/salad/schema_salad/metaschema/metaschema_base.yml#L135 which defines the symbols field as follows:

      type: string[]
      jsonldPredicate:
        _id: "sld:symbols"
        _type: "@id"
        identity: true
      doc: "Defines the set of valid sy

The _type: "@id" and identity: true might be fine for the metaschema of Schema Salad, but is inappropriate for a CWL Enum definition, where we expect plain strings.

cwltool ignores this, but codegen readers of CWL (cwljava, cwl-utils) cannot.

(all of this is also true for the Output and non-Command versions of EnumSchema in CWL)

I can't find a way to override this with a specializeFrom/To nor with overlapping field definitions, nor with re-rooting the CWL schema to not derive from sld:EnumSchema as the metaschema predicates collide with the CWL schema predicates

schema_salad.exceptions.SchemaException: Predicate collision on symbols, '{'@id': 'https://w3id.org/cwl/salad#symbols', '@type': '@id', 'identity': True}' != 'https://w3id.org/cwl/cwl#InputEnumSchema/symbols'

(even if I use jsonldPredicate: { _id: "sld:symbols" } ):

schema_salad.exceptions.SchemaException: Predicate collision on symbols, '{'@id': 'https://w3id.org/cwl/salad#symbols', '@type': '@id', 'identity': True}' != '{'@id': 'https://w3id.org/cwl/salad#symbols'}'

Fix options:

  1. Hack the schema (fixes the codegen, but introduces big problems down the line): Re-root the CWL schema to not extend sld:EnumSchema and rename symbols from the metaschema to be something else (and update all other schema salad documents to use this new name in their definitions)
  2. Hack the codegen (manually change the type of the symbols field post generation): would unblock https://github.com/common-workflow-lab/cwljava/issues/70 but it not sustainable
  3. Scope the predicates situation so that schema-salad documents themselves can use any predicate name from the metaschema, and then apply fix 1 without renaming the existing use of symbols in the metaschema. I like this, but I don't know how to implement it yet
  4. Do nothing. Consumers of the CWL codegen libraries will be forced to subtract out the ID of the parent object to get the true values of CWL type: enums.

mr-c avatar Jan 14 '22 11:01 mr-c

Another option that might work: remove https://github.com/common-workflow-language/cwl-v1.2/blob/a0f2d38e37ff51721fdeaf993bb2ab474b17246b/Process.yml#L15 and copy in the salad metaschema parts we need, making the one change to the symbols definition

mr-c avatar Jan 14 '22 13:01 mr-c

The TypeScript version has the same issue: https://runkit.com/zimmera/61e180e8b299c2000851e088 (output expandable at bottom)

ZimmerA avatar Jan 14 '22 13:01 ZimmerA

@ZimmerA Thanks. Yeah, it is a schema issue; see https://github.com/common-workflow-language/cwl-v1.2/pull/145 for my proposed fix

mr-c avatar Jan 14 '22 15:01 mr-c

Enums are not intended to be pure strings, they're intended to be qualified URIs, but part of the vocabulary so you're able to use the shortened version. It sounds like it is behaving as designed, but the design is surprising.

tetron avatar Jan 14 '22 15:01 tetron

Enums are not intended to be pure strings, they're intended to be qualified URIs, but part of the vocabulary so you're able to use the shortened version. It sounds like it is behaving as designed, but the design is surprising.

I agree for Schema Salad, but not for CWL.

CWL uses enums to create command line interfaces

https://www.commonwl.org/v1.2/CommandLineTool.html#CommandLineBinding doesn't say that we only use the short versions for the command line (but in fact cwltool does so)

So I think in practice the concepts are different.

mr-c avatar Jan 14 '22 15:01 mr-c

@tetron https://github.com/common-workflow-language/cwltool/pull/1591 shows that there is no cwltool unit test nor CWL conformance test that relies on enum symbols being qualified URIs; therefore I argue that they are factually not so and the schema proposal in https://github.com/common-workflow-language/cwl-v1.2/pull/145 should be accepted for CWL v1.2.1 to confirm that.

mr-c avatar Jan 15 '22 14:01 mr-c

Related: https://github.com/common-workflow-language/schema_salad/issues/132

mr-c avatar Jan 19 '22 11:01 mr-c

Ok, I refreshed my recollection of what cwltool does.

cwltool has been operating under the premise that symbols are expanded to URIs for a long time, e.g.

$ cwltool --print-pre blah.cwl
INFO /home/peter/work/cwltool/venv3/bin/cwltool 3.1.20220103201022
INFO Resolved 'blah.cwl' to 'file:///home/peter/work/tmp/enum/blah.cwl'
{
    "arguments": [
        {
            "prefix": "species",
            "valueFrom": "$(inputs.first.species)"
        },
        {
            "prefix": "ncbi_build",
            "valueFrom": "$(inputs.first.ncbi_build)"
        }
    ],
    "baseCommand": "echo",
    "class": "CommandLineTool",
    "cwlVersion": "v1.2",
    "id": "file:///home/peter/work/tmp/enum/blah.cwl",
    "inputs": [
        {
            "id": "file:///home/peter/work/tmp/enum/blah.cwl#first",
            "type": "file:///home/peter/work/tmp/enum/blah.cwl#vcf2maf_params"
        }
    ],
    "outputs": [
        {
            "id": "file:///home/peter/work/tmp/enum/blah.cwl#result",
            "outputBinding": {
                "glob": "8e100d950d274f5732d81f3ac0bf3712b420fa07"
            },
            "type": "File"
        }
    ],
    "requirements": [
        {
            "class": "SchemaDefRequirement",
            "types": [
                {
                    "fields": [
                        {
                            "name": "file:///home/peter/work/tmp/enum/blah.cwl#vcf2maf_params/ncbi_build",
                            "type": [
                                "null",
                                {
                                    "symbols": [
                                        "file:///home/peter/work/tmp/enum/blah.cwl#vcf2maf_params/ncbi_build/GRCh37",
                                        "file:///home/peter/work/tmp/enum/blah.cwl#vcf2maf_params/ncbi_build/GRCh38",
                                        "file:///home/peter/work/tmp/enum/blah.cwl#vcf2maf_params/ncbi_build/GRCm38"
                                    ],
                                    "type": "enum"
                                }
                            ]
                        },
                        {
                            "name": "file:///home/peter/work/tmp/enum/blah.cwl#vcf2maf_params/species",
                            "type": [
                                "null",
                                {
                                    "symbols": [
                                        "file:///home/peter/work/tmp/enum/blah.cwl#vcf2maf_params/species/homo_sapiens",
                                        "file:///home/peter/work/tmp/enum/blah.cwl#vcf2maf_params/species/mus_musculus"
                                    ],
                                    "type": "enum"
                                }
                            ]
                        }
                    ],
                    "name": "file:///home/peter/work/tmp/enum/blah.cwl#vcf2maf_params",
                    "type": "record"
                }
            ]
        }
    ],
    "stdout": "8e100d950d274f5732d81f3ac0bf3712b420fa07"
}

What happens is that for the purposes of validation, it gets the short name of the URI. How to get the short name is described in the schema salad spec:

https://www.commonwl.org/v1.2/SchemaSalad.html#Short_names

cwl-utils / cwljava should provide a conforming "shortname" function if it doesn't already.

The python_codegen itself also assumes enum symbols are URIs, and calls safe_name (which computes the short name).

                return self.declare_type(
                    TypeDef(
                        self.safe_name(type_declaration["name"]) + "Loader",
                        '_EnumLoader(("{}",))'.format(
                            '", "'.join(
                                self.safe_name(sym)
                                for sym in type_declaration["symbols"]
                            )
                        ),
                    )
                )

From my perspective, the correct thing to do is 4

Do nothing. Consumers of the CWL codegen libraries will be forced to subtract out the ID of the parent object to get the true values of CWL type: enums.

I feel that the proposed solution (making enum symbols unparsed strings) is effectively redefining the semantics of the enum data model which is inappropriate for a 1.2.1 version. It would be better to ensure existing validation and Command Line behavior is properly specified (i.e. derive the short name from the URI and check the string against that) for the purposes of checking the source document.

I agree that https://www.commonwl.org/v1.2/CommandLineTool.html#CommandLineBinding doesn't explicitly describe behavior for enums. However, in practice the current behavior is that it accepts short name as a string, and then the command line behavior is identical as for a string.

tetron avatar Jan 19 '22 21:01 tetron