pattern validation needs to be added to runtime module
The ability to specify regular expressions was added in https://github.com/biolink/biolinkml/issues/127. This, however, provides information necessary to emit patterns as part of the JSON Schema. (Question: did this bit get added TO the JSON Schema generator???). This leaves us with a couple of outstanding issues:
- Since we're not actually using the expressions, they don't currently have to conform to any syntax (!!!)
- Assuming that they are being emitted by the JSON Schema Generator, we will need to decide whether the JSON Schema RE specification is the LinkML conforms to or Python RE (or something else). No matter what the decision is, we will need to create validators and conversion code between Python and JSON Schema RE's.
- RE validation needs to be added to the linkml-runtime -- both in the generated python and in the loaders
I thought this was in the jsonschemagen but I imagined it. Would be easy to add. json schema supports the "pattern" keyword
We also have to specify the regex grammar.
Our targets include
- python
- json schema
- OWL
JSON-Schema supports a subset of javascript/ECMA regexes, described here:
https://json-schema.org/understanding-json-schema/reference/regular_expressions.html#example
OWL uses xsd:patterns on datatype restrictions:
https://www.w3.org/TR/xmlschema11-2/#rf-pattern
Not sure about Python but there is a re-compatible library that supports POSIX: https://pypi.org/project/regex/
I think we should formally support the intersection of these different formalisms
I recollect doing some work in PyShEx to translate between whatever form or RE ShEx supports (I'll have to ping Eric P about what that is) and Python.
I'd propose that in the near(er) term we think about punting and state that the regex must be an intersection but not try to enforce it programmatically unless (until) problems arise
Note: @wdduncan is going to add this to jsonschemagen https://github.com/linkml/linkml/issues/193 so it would be good to decide
While working on LinkML DataHarmonizer integration, I found in the linkml MIxS specification (https://github.com/GenomicsStandardsConsortium/mixs-source/tree/main/model/schema ) I am encountering a lot of slot definition patterns that seem to be making reference to some parsing system that has a few tricks besides regular expressions. I was wondering if a parser is available via the linkml system for them. Here's the list roughly, showing that some kind of named pattern replacement is at work? Or is it simply and more likely that MIxS pattern specifications are informally written and must be revised to work as regular expressions?
"{text}" "{text};{text}" "{rank name}:{text}" "{termLabel} {[termID]}" e.g. "Growth chamber [CO_715:0000189]" "{text}|{termLabel} {[termID]}" "{boolean};{timestamp}" "{boolean}" "{boolean};{boolean}" "{text};{integer}/[year|month|week|day|hour]" "{boolean};{text}" "{boolean};{float} {unit}" "{boolean};[adverse event|non-compliance|lost to follow up|other-specify]" "{duration}" "{float} - {float} {unit}" "{integer} - {integer} {unit}" "{float} {unit};{float} {unit}" e.g. "5 days;-20 degree Celsius" "{text};FWD:{dna};REV:{dna};initial denaturation:degrees_minutes;denaturation:degrees_minutes;annealing:degrees_minutes;elongation:degrees_minutes;final elongation:degrees_minutes; total cycles" "{text};{text};{timestamp}" e.g. "ALPHA 1427;Baker Hughes;2008-01-23" "{float} {unit};{Rn/start_time/end_time/duration};{duration}" "{text};{text};{timestamp}" e.g. "ACCENT 1125;DOW;2010-11-17"
"{float} {unit};{Rn/start_time/end_time/duration}" e.g. "25 degree Celsius;R2/2018-05-11T14:30/2018-05-11T19:30/P1H30M" "{float} {unit};{Rn/start_time/end_time/duration}" e.g. "25 gram per cubic meter;R2/2018-05-11T14:30/2018-05-11T19:30/P1H30M"
"{PMID}|{DOI}|{URL}|{text}" "{text}|{PMID}|{DOI}|{URL}" e.g. "http://himedialabs.com/TD/PT158.pdf"
"{termLabel} {[termID]} or [husk|other artificial liquid medium|other artificial solid medium|peat moss|perlite|pumice|sand|soil|vermiculite|water]" e.g. "hydroponic plant culture media [EO:0007067]"
"[clean catch|catheter]" "[horizontal:castrator|horizontal:directly transmitted|horizontal:micropredator|horizontal:parasitoid|horizontal:trophically transmitted|horizontal:vector transmitted|vertical]"
@cmungall @hsolbrig , Mark and I discussed this and he made clear that the above MIxS validation expressions were not designed for some kind of parsing/translation destined for regular expressions. He's making regular expression equivalents for at least some of above expressions.
So that leaves one idea on the table which is to allow for each linkml schema, a reference to a dictionary of regex patterns by name so that the patterns don't have to be "hard coded' so directly? In other words to create a dictionary regex_library = { 'email':'^[^@\s]+@[^@\s.]+.[^@.\s]+$', 'text': '\S+( \S+)*' , 'decimal': '' , ...}
And then allow pattern to be search & replaced for regex content before being submitted to regex engine for validation:
python:
'{email}'.format(**regex_library)) which yields '^[^@\s]+@[^@\s.]+.[^@.\s]+$'
javascript:
'{email}' transformed to ${regex_library.email} which yields above too.
Any other regex content that doesn't match a dictionary item is quietly passed through to regex engine as is.
Thoughts?
Splitting this into multiple issues
- #1696
- #1695
- Structured patterns #176