schema icon indicating copy to clipboard operation
schema copied to clipboard

Update EDTF regex to be more lenient, or use custom format instead

Open bdarcus opened this issue 2 years ago • 24 comments

While working on #420, I realize the EDTF regex is too strict. In that case, I noticed it doesn't allow date-times, but I'm sure there are others, and my regex skills are too limited to feel comfortable tackling it on that PR.

See also https://github.com/citation-style-language/documentation/issues/145


See https://github.com/citation-style-language/schema/pull/420#issuecomment-1145082063

Regexes cannot realistically test for valid edtf dates across the range of possible dates/date ranges/date combos expressable in edtf, so it'd either be too lenient or too restrictive.

So really the question is what, more lenient, regex will work here?

bdarcus avatar Jun 02 '22 15:06 bdarcus

@retorquere - sorry, I should have moved the thread here earlier.

Personally, I'd add edtf validation as custom format attributes, and provide a reference implementation using ajv + EDTF.js. If I'm reading https://json-schema.org/draft/2020-12/json-schema-validation.html#rfc.section.7.2.3 right, processors that don't know about these custom properties would validate presence but not format of the property.

This makes sense probably. In any case, a regex doesn't seem practical.

Here's what it would like in the schema, except ajv will throw an error if it doesn't understand the format. So maybe best to just defer this.

    "edtf-string": {
      "title": "EDTF date string",
      "description": "CSL input supports EDTF, levels 0 and 1.",
      "type":"string",
      "format": "edtf/level_0+1"
    }

It's seems like this just shouldn't go in the schema, and validation can be recommended independently somehow.

Any suggestions @inukshuk?

bdarcus avatar Jun 03 '22 12:06 bdarcus

@retorquere - sorry, I should have moved the thread here earlier.

Personally, I'd add edtf validation as custom format attributes, and provide a reference implementation using ajv + EDTF.js. If I'm reading https://json-schema.org/draft/2020-12/json-schema-validation.html#rfc.section.7.2.3 right, processors that don't know about these custom properties would validate presence but not format of the property.

This makes sense probably. In any case, a regex doesn't seem practical.

Here's what it would like in the schema, except ajv will throw an error if it doesn't understand the format. So maybe best to just defer this.

Then it looks like it's out of spec:

6.5. Extending JSON Schema

Additional schema keywords and schema vocabularies MAY be defined by any entity. Save for explicit agreement, schema authors SHALL NOT expect these additional keywords and vocabularies to be supported by implementations that do not explicitly document such support. Implementations SHOULD ignore keywords they do not support.

retorquere avatar Jun 05 '22 07:06 retorquere

ajv is looks to be out of spec by default, but it can be configured to be compliant:

const Ajv = require("ajv")
const ajv = new Ajv({ strict: false})

const schema = {
  type: "object",
  properties: {
    foo: {type: "integer"},
    bar: {type: "string"},
    baz: {other: true}
  },
  required: ['baz', "foo"],
  additionalProperties: false
}

const data = {foo: 1, bar: "abc", baz: 5}
const valid = ajv.validate(schema, data)
if (!valid) console.log(ajv.errors)

so I'd say it'd be a good thing to ship a schema with this extension, certainly if we provide a reference implementation that does do validation.

retorquere avatar Jun 05 '22 07:06 retorquere

A problem I've noticed in my dabbling with the JSON schema ecosystem is it's not very consistent. So it's not always enough to do things according to spec.

Some questions we'll want to answer:

  1. Beyond ajv, what the other primary validation tools? How do they handle "format"?
  2. How do IDE's handle this?

On 1, at least https://www.jsonschemavalidator.net/ ignores the "format" property.

In any case, I'll leave this issue open and we can figure out WDT as we get closer to a release (including agreement among developers about 1.1 in general).

I agree, though, we ideally want this:

... it'd be a good thing to ship a schema with this extension, certainly if we provide a reference implementation that does do validation.

... which would also mean adding back that "format" property, once we settle what it's value should be, etc..

bdarcus avatar Jun 05 '22 11:06 bdarcus

for ajv, this would work. I could look at other json-schema implementations. This implementation could be shipped as a module that adds the validator to ajv by simply requiring the module, it's a common way to ship ajv extensions.

const Ajv = require("ajv")
const ajv = new Ajv({ strict: false})
const { parse } = require('edtf')

const schema = {
  type: "object",
  properties: {
    foo: {type: "integer"},
    bar: {type: "string"},
    baz: {edtf: [0, 1]}
  },
  required: ['baz', "foo"],
  additionalProperties: false
}

function edtfValidator(levels, date) {
  edtfValidator.errors = []

  if (typeof date !== 'string') {
    edtfValidator.errors.push('not a valid EDTF date')
    return false
  }

  try {
    const parsed = parse(date)

    if (!levels.includes(parsed.level)) {
      edtfValidator.errors.push(`expected EDTF level ${levels.map(l => `${l}`).join(',')}, got ${parsed.level}`)
      return false
    }
  }
  catch (err) {
    edtfValidator.errors.push('not a valid EDTF date')
    return false
  }

  return true
}
ajv.addKeyword({
  keyword: 'edtf',
  validate: edtfValidator,
})

const data = {foo: 1, bar: "abc", baz: '2016-XX'}
const valid = ajv.validate(schema, data)
if (!valid) console.log(ajv.errors)

retorquere avatar Jun 05 '22 17:06 retorquere

This implementation could be shipped as a module that adds the validator to ajv by simply requiring the module, it's a common way to ship ajv extensions.

What would that look like?

Care to submit a PR?

bdarcus avatar Jun 05 '22 18:06 bdarcus

#422

I was wrong on the way to require it in for ajv, but I've included tests which show how to apply the formats for 4 json-schema validators. There are a ton on npmjs, I can add more implementations, but these 4 seemed to be the fastest so they should be popular.

retorquere avatar Jun 05 '22 22:06 retorquere

alternately this could be spun off as it's own npm module, or offered to @inukshuk for inclusion in EDTF.js

retorquere avatar Jun 05 '22 22:06 retorquere

Many thanks for the PR!

alternately this could be spun off as it's own npm module, or offered to @inukshuk for inclusion in EDTF.js

Yes, I was wondering about this earlier. What do you say @inukshuk?

I definitely think it would be more widely useful beyond CSL.

bdarcus avatar Jun 05 '22 23:06 bdarcus

BTW, I just tested the "format" property with this supposedly fast and compliant go validator, and it correctly ignored it.

https://github.com/santhosh-tekuri/jsonschema

bdarcus avatar Jun 06 '22 00:06 bdarcus

This looks nice! Where you thinking of bundling a validate function with edtf.js or something beyond that?

As for the validation, the levels 0-2 are strictly in order; in the few cases where grammar rules of different levels can both produce a given string, the parser will always report the lower level. That is to say, I don't think it's necessary to define level ranges like 0+1, just specifying the highest level should work as well. Similarly, if you set the parser to a certain level, e.g. edtf.parse('...', { level: 1 }) it will also accept level 0 strings.

Finally, there was the old issue of season intervals. Those are currently not supported by the spec, but edtf.js supports them, specifically because @retorquere and others rightly argued that they are important for bibliographic use. Season intervals are defined at level 3 so we'd have to make up our mind how to handle those during validation.

inukshuk avatar Jun 06 '22 08:06 inukshuk

This looks nice! Where you thinking of bundling a validate function with edtf.js or something beyond that?

I think that would be the best option, really. There's currently 4 supported json-schema libraries that could mostly transfer to edtf.js as-is, and adding new ones looks to be trivial.

As for the validation, the levels 0-2 are strictly in order; in the few cases where grammar rules of different levels can both produce a given string, the parser will always report the lower level. That is to say, I don't think it's necessary to define level ranges like 0+1, just specifying the highest level should work as well. Similarly, if you set the parser to a certain level, e.g. edtf.parse('...', { level: 1 }) it will also accept level 0 strings.

I've updated the implementation to support edtf/[0,1,2,3] and edtf/[0,1]+3.

I'd be happy to submit a PR on EDTF.js. The test script (or something like it) on this repo could perhaps remain to show how to use it for this schema.

Finally, there was the old issue of season intervals. Those are currently not supported by the spec, but edtf.js supports them, specifically because @retorquere and others rightly argued that they are important for bibliographic use. Season intervals are defined at level 3 so we'd have to make up our mind how to handle those during validation.

It's for this reason I've included the +3 syntax. Parsing using {level: X} would be convenient, but given the description above, {level: 3} seems different from testing for the case of detected level [0,1,3].

retorquere avatar Jun 06 '22 09:06 retorquere

Oh, on re-reading your comment: something further, I was thinking to move the ready-to-run schema format functions now in the PR to EDTF.js

retorquere avatar Jun 06 '22 09:06 retorquere

Sounds good, let's do that!

inukshuk avatar Jun 06 '22 09:06 inukshuk

Is there an easy way to say "level 1 with season-range support, but not level 2"? Right now I'm walking the parse results to test for level detected, but I'm very likely re-doing work already done in EDTF.js.

retorquere avatar Jun 06 '22 09:06 retorquere

In terms of efficiency, that's not a problem. EDTF.js uses an Earley parser which can produce multiple valid derivations for the same string. So what the parse method currently does is pretty much the same thing as your approach: discarding derivations which don't fit the level requirement.

Since 'Level 3' doesn't actually exist I wonder if we shouldn't use a different syntax to say that we want to either include extensions or just season intervals in particular? I.e., the validation itself would stay the same for the time being, just that we use a different format to express 'edtf level 1 + season intervals'.

inukshuk avatar Jun 06 '22 11:06 inukshuk

Since 'Level 3' doesn't actually exist I wonder if we shouldn't use a different syntax to say that we want to either include extensions or just season intervals in particular? I.e., the validation itself would stay the same for the time being, just that we use a different format to express 'edtf level 1 + season intervals'.

Just "edtf/level-0+level-1+seasons"?

So +-delimited flags, which can either be full levels, or feature components of those levels?

PS - seasons are definitely important for this use case.

bdarcus avatar Jun 06 '22 11:06 bdarcus

Technically possible, but it'd mean implementing all format permutations. I reckon 0, 0+season-ranges, 1, 1+season-ranges, 2, 2+season-ranges would suffice for json schema formats, and for the parser, something like { level: 1, seasonRanges: true} should do.

retorquere avatar Jun 06 '22 12:06 retorquere

The EDTF spec leaves some room when describing compliance. E.g., "Level 1 is supported, and in addition the following features of level 2 are supported (list features)." so I think this is definitely preferable over using the non-existent level 3.

However, I'd definitely go with "edtf/level-1+seasons-intervals" because level 1 includes level 0 already, there's no reason to declare it explicitly.

Perhaps we can use something shorter than "season-intervals"; just "seasons" might be misleading, because I might want to specify something like "edtf/level-0+seasons" (since seasons is a level 1 feature).

inukshuk avatar Jun 06 '22 13:06 inukshuk

This way of encoding it in the format string is not the best way though. The implementations I know support type string+a fixed number of explicit formats, or just a new keyword, and then something like edtf: { level:1, seasonInterval: true} rather than a string to parse would be better. The benefit of type string + format is that parsers that don't know about the format will still check stringness and will not error out.

retorquere avatar Jun 06 '22 15:06 retorquere

Does that mean one can use an object for the format property?

bdarcus avatar Jun 06 '22 15:06 bdarcus

Oh yes, passing it as an extra option would be much preferable.

inukshuk avatar Jun 06 '22 15:06 inukshuk

I don't think string+format allows for format to be an object. We'd be looking at a custom keyword, which has varying support in validator libraries.

retorquere avatar Jun 06 '22 15:06 retorquere

https://github.com/retorquere/json-schema-edtf

retorquere avatar Jun 08 '22 21:06 retorquere