linkml icon indicating copy to clipboard operation
linkml copied to clipboard

pattern validation needs to be added to runtime module

Open hsolbrig opened this issue 4 years ago • 5 comments

The ability to specify regular expressions was added in https://github.com/biolink/biolinkml/issues/127. This, however, provides information necessary to emit patterns as part of the JSON Schema. (Question: did this bit get added TO the JSON Schema generator???). This leaves us with a couple of outstanding issues:

  1. Since we're not actually using the expressions, they don't currently have to conform to any syntax (!!!)
  2. Assuming that they are being emitted by the JSON Schema Generator, we will need to decide whether the JSON Schema RE specification is the LinkML conforms to or Python RE (or something else). No matter what the decision is, we will need to create validators and conversion code between Python and JSON Schema RE's.
  3. RE validation needs to be added to the linkml-runtime -- both in the generated python and in the loaders

hsolbrig avatar Apr 23 '21 16:04 hsolbrig

I thought this was in the jsonschemagen but I imagined it. Would be easy to add. json schema supports the "pattern" keyword

We also have to specify the regex grammar.

Our targets include

  • python
  • json schema
  • OWL

JSON-Schema supports a subset of javascript/ECMA regexes, described here:

https://json-schema.org/understanding-json-schema/reference/regular_expressions.html#example

OWL uses xsd:patterns on datatype restrictions:

https://www.w3.org/TR/xmlschema11-2/#rf-pattern

Not sure about Python but there is a re-compatible library that supports POSIX: https://pypi.org/project/regex/

I think we should formally support the intersection of these different formalisms

cmungall avatar Apr 23 '21 17:04 cmungall

I recollect doing some work in PyShEx to translate between whatever form or RE ShEx supports (I'll have to ping Eric P about what that is) and Python.

I'd propose that in the near(er) term we think about punting and state that the regex must be an intersection but not try to enforce it programmatically unless (until) problems arise

hsolbrig avatar Apr 23 '21 17:04 hsolbrig

Note: @wdduncan is going to add this to jsonschemagen https://github.com/linkml/linkml/issues/193 so it would be good to decide

cmungall avatar Jun 18 '21 19:06 cmungall

While working on LinkML DataHarmonizer integration, I found in the linkml MIxS specification (https://github.com/GenomicsStandardsConsortium/mixs-source/tree/main/model/schema ) I am encountering a lot of slot definition patterns that seem to be making reference to some parsing system that has a few tricks besides regular expressions. I was wondering if a parser is available via the linkml system for them. Here's the list roughly, showing that some kind of named pattern replacement is at work? Or is it simply and more likely that MIxS pattern specifications are informally written and must be revised to work as regular expressions?

"{text}" "{text};{text}" "{rank name}:{text}" "{termLabel} {[termID]}" e.g. "Growth chamber [CO_715:0000189]" "{text}|{termLabel} {[termID]}" "{boolean};{timestamp}" "{boolean}" "{boolean};{boolean}" "{text};{integer}/[year|month|week|day|hour]" "{boolean};{text}" "{boolean};{float} {unit}" "{boolean};[adverse event|non-compliance|lost to follow up|other-specify]" "{duration}" "{float} - {float} {unit}" "{integer} - {integer} {unit}" "{float} {unit};{float} {unit}" e.g. "5 days;-20 degree Celsius" "{text};FWD:{dna};REV:{dna};initial denaturation:degrees_minutes;denaturation:degrees_minutes;annealing:degrees_minutes;elongation:degrees_minutes;final elongation:degrees_minutes; total cycles" "{text};{text};{timestamp}" e.g. "ALPHA 1427;Baker Hughes;2008-01-23" "{float} {unit};{Rn/start_time/end_time/duration};{duration}" "{text};{text};{timestamp}" e.g. "ACCENT 1125;DOW;2010-11-17"

"{float} {unit};{Rn/start_time/end_time/duration}" e.g. "25 degree Celsius;R2/2018-05-11T14:30/2018-05-11T19:30/P1H30M" "{float} {unit};{Rn/start_time/end_time/duration}" e.g. "25 gram per cubic meter;R2/2018-05-11T14:30/2018-05-11T19:30/P1H30M"

"{PMID}|{DOI}|{URL}|{text}" "{text}|{PMID}|{DOI}|{URL}" e.g. "http://himedialabs.com/TD/PT158.pdf"

"{termLabel} {[termID]} or [husk|other artificial liquid medium|other artificial solid medium|peat moss|perlite|pumice|sand|soil|vermiculite|water]" e.g. "hydroponic plant culture media [EO:0007067]"

"[clean catch|catheter]" "[horizontal:castrator|horizontal:directly transmitted|horizontal:micropredator|horizontal:parasitoid|horizontal:trophically transmitted|horizontal:vector transmitted|vertical]"

ddooley avatar Dec 05 '21 07:12 ddooley

@cmungall @hsolbrig , Mark and I discussed this and he made clear that the above MIxS validation expressions were not designed for some kind of parsing/translation destined for regular expressions. He's making regular expression equivalents for at least some of above expressions.

So that leaves one idea on the table which is to allow for each linkml schema, a reference to a dictionary of regex patterns by name so that the patterns don't have to be "hard coded' so directly? In other words to create a dictionary regex_library = { 'email':'^[^@\s]+@[^@\s.]+.[^@.\s]+$', 'text': '\S+( \S+)*' , 'decimal': '' , ...}

And then allow pattern to be search & replaced for regex content before being submitted to regex engine for validation: python: '{email}'.format(**regex_library)) which yields '^[^@\s]+@[^@\s.]+.[^@.\s]+$' javascript: '{email}' transformed to ${regex_library.email} which yields above too.

Any other regex content that doesn't match a dictionary item is quietly passed through to regex engine as is.

Thoughts?

ddooley avatar Dec 08 '21 17:12 ddooley

Splitting this into multiple issues

  • #1696
  • #1695
  • Structured patterns #176

cmungall avatar Nov 06 '23 15:11 cmungall