json-schema-spec icon indicating copy to clipboard operation
json-schema-spec copied to clipboard

✨ Proposal: `format` registry

Open gregsdennis opened this issue 1 year ago • 20 comments

Describe the inspiration for your proposal

Open API already has defined several formats that are arguably useful outside of the context of their spec, but I don't think we should add them to our spec.

Additionally, we receive a lot of requests to add new formats to the spec. A registry could be a good middle ground.

Describe the proposal

We should support a registry for different formats.

We don't have any precedent for creating a registry, so I'm open to ideas on exactly how to do that. I think a file in GH that ends up being push to the website that could then be referenced by the spec would be an okay idea.

Describe alternatives you've considered

No response

Additional context

There is a concern about adding additional burden for implementations around having to support all of the formats. I'm open to ideas here as well.

Really just opening the floor for exploration.

gregsdennis avatar Oct 28 '24 04:10 gregsdennis

I like the idea, but my main concern is that if most of these formats are optional, very few will actually implement them, rendering them mostly unusable for any interoperable use case.

I would almost prefer a more limited set of formats defined by us and mandatory for everybody. I feel just forcing everybody to do the same thing, even if imperfect, can be better than suggesting everybody to do a lot of things.

jviotti avatar Oct 28 '24 20:10 jviotti

Personally, I think a combination of the two.

I think we should enforce some "near-primitives" that makes interoperability hard if we don't (the date and time related formats specifically. They are ubiquitous and notoriously hard to interop without a solid spec)

Then there are the "arguables" that I think should be required because we use them in the metaschema - the URI/IRI family.

Then there are the things I think should be in a registry. (u)int32,64,128, uuid, email etc. etc.

mwadams avatar Oct 29 '24 08:10 mwadams

Open API already has a format registry. Ideally we should be involved with that registry, but I don't want to have a competing registry. In the past, they've expressed willingness to donate ownership of the registry to us if we want it. Co-ownership of the registry would be a good outcome as well.

jdesrosiers avatar Oct 31 '24 05:10 jdesrosiers

The discussions have been to transfer that registry here to be in a centralized location closer to the source. I should have been more explicit.

gregsdennis avatar Oct 31 '24 06:10 gregsdennis

Great. Then I'm 100% on board.

jdesrosiers avatar Oct 31 '24 18:10 jdesrosiers

100% in favour.

We could include data on what implementations (of JSON Schema or OpenAPI, etc) are known to implement the formats, as well as information about their intended use and their origin, when known/relevant.

For example, for the formats transferred from OpenAPI, there is a specific list of formats that are defined by their specification (see https://spec.openapis.org/oas/v3.1.1#data-type-format), so we can identify that document as the location of the canonical definitions for these formats.

I would also suggest we set out in advance what the criteria are for a format being added to this registry. I don't want people sending PRs for all the formats they can think of from the top of their heads; it should be formats that are already in widespread use, preferrably by more than one application/implementation. We also will want explicit descriptions of the json data type(s) that the format applies to, references to any underlying standards (e.g. RFC documents), or other clearly-defined descriptions of the syntax.

We should also create a parallel set of directories containing tests -- whose syntax we can bikeshed, but something similar to our JSON Schema Test Suite as a starting point would be good (instead of "schema", just use the format name, along with a list of data/description/valid tuples). Any submission to the format registry should be accompanied by a decent corpus of passing and failing tests in this format.

karenetheridge avatar Nov 07 '24 22:11 karenetheridge

Then there are the things I think should be in a registry. (u)int32,64,128, uuid, email etc. etc.

The int ones I would not handle in an explicit way, because there will always be one needing something more and tools have to adapt. Better in my opinion would be to allow to specify the number of total bits, mantisse bits and fraction bits, if one really needs this. For "float" types, it would have exponent bits instead of fraction bits and perhaps additionally the base of the exponent, defaulting to 10. However, JSON is transmitting numbers as characters and this is just about an internal representation. So having the range and multipleOf is already sufficient for integers. Only for floating point values would need resolution information.

torsknod-the-caridian avatar Nov 08 '24 06:11 torsknod-the-caridian

@torsknod-the-caridian thanks for the thoughts. Can you edit that so it doesn't all look like a block-quote, please?

gregsdennis avatar Nov 08 '24 07:11 gregsdennis

Sounds like we're mostly in favor. Let's discuss what we need to get it done.

At a high level, I expect we'll need:

  • A location to publish the list of registered formats
    • I do like @karenetheridge's suggestions for meta-data for each entry (defining authority, etc)
  • A source file in this repo that generates the publication page
  • A section in the Validation spec defining the registry, its purpose, and implementation requirements

Anything else?

gregsdennis avatar Nov 08 '24 23:11 gregsdennis

What about this ?

  • Testing and validation

abhayymishraa avatar Jan 05 '25 06:01 abhayymishraa

I'd like to move forward with this next. (Much of this is just reiterating @karenetheridge's comment.)

This is going to require multiple pieces all working together.

Registry

The data set we need to collect is:

  • the format "key", e.g. date-time
  • where it is defined (do we need to state the defining body, or will a link suffice?)
  • applicable JSON types (defined by the spec, but would be good to have here)
  • examples
  • implementations that support it

A simple approach wins out in my mind: a single JSON file that is an array of objects, where each object holds the above data.

[
  {
    "format": "date-time",
    "definition": "https://json-schema.org/specifications/core/1/2025#format-date-time",
    "types": ["string"]
    "examples": [],
    "supportedBy": []
  },
  // ...
]

Tests

I think the tests could just go into the Test Suite. I figure maybe a new formats folder with a file for each format. The file doesn't need to be the same format as the rest of the suite IMO.

Could use @Julian's opinion here.

Submitting a New Format

We should have an issue template in this repo that requires submitters to provide all the info we need.

  • the format key
  • where it's defined
  • summary of the format itself
  • examples of common usage (require 3?)
    • alternatively, the "where it's defined" need to be in common usage, like OpenAPI
  • implementation support

I don't think we need to require supporting implementations for the format to be added to the list. Adding to the list should be the catalyst for getting support. (Seeking opinions here)

To ensure that we have test coverage, we could also have a Github action for new format PRs that checks to see if there's a test suite PR linked.

gregsdennis avatar Jan 10 '25 21:01 gregsdennis

I think the tests could just go into the Test Suite. I figure maybe a new formats folder with a file for each format. The file doesn't need to be the same format as the rest of the suite IMO.

Seems reasonable to me too!

Julian avatar Jan 13 '25 13:01 Julian

In writing the description of the PR ☝️, I was wondering if we should require externally-defined formats to be registered with a namespace. For example, we could require that the OpenAPI int32 format be something like oas3.1.1:int32.

And writing that out just now I realize that would break OAS... forgive me, it's late here.

Anyway, the primary benefit I thought it would have is that it would allow multiple bodies to register similar formats under the same name or even for the same body to register multiple versions of the same format. Maybe this is more a question for the vocab proposal, whatever that ends up being. Maybe an evolution of the vocab proposal could allow for the namespace to be optional.

gregsdennis avatar Jan 15 '25 09:01 gregsdennis

I've been made aware in the OpenAPI Slack that, although we have discussed transferring their registry here, that needs to be discussed with their TSC. I'll add it to their agenda, though I may not be able to join myself.

gregsdennis avatar Jan 18 '25 02:01 gregsdennis

Transfering the OpenAPI regsitry here was officially discussed ☝ and the decision was that they'll continue to host their registry, but they see no reason we can't also include their formats in ours. Eventually, they may see fit to deprecate theirs in favor of ours, but for now it'll live in both places.

I think this is fine. I'll update the PR.

gregsdennis avatar Jan 31 '25 07:01 gregsdennis

I have some questions and concerns, largely similar to @jviotti's original comment.

Is the format registry versioned? How does it relate to the JSON Schema spec version?

Say I have a validator that is 100% spec-compliant with JSON Schema version X, including supporting all formats defined in the registry (at the time the validator was written). And I also want to validate a value against a particular version X schema - that should work, right?

But what if the schema is using a format that was only recently defined in the format registry, and the validator hasn't been updated yet? Presumably the validator must reject the schema because it contains an unsupported format - but that seems like a major interoperability headache for users, because a fully compliant but slightly-outdated validator may reject a fully compliant schema. IMO it also kinda breaks the whole point of spec versioning.

And of course a similar issue exists around the fact that format registry support is optional. A validator supporting none of the formats from the registry is technically compliant, but will still reject many valid schemas. Is that something consumers will just have the deal with?

GREsau avatar May 27 '25 17:05 GREsau

As a separate matter, have you considered including a JSON schema in the registry for formats where possible?

e.g. the entry for char might look like

"char": {
  "description": "A single character",
  "types": ["string"],
  "examples": ["a"],
  "deprecated": false,
  "supportedBy": [],
  "schema": {
    "minLength": 1,
    "maxLength": 1
  }
},

The idea being:

  1. It would resolve any potential ambiguities in the spec wording
  2. Implementors could use that to support many new formats automatically - i.e. if they encounter a format that they don't explicitly support, but it does have a schema defined in the registry, then they can fall back to evaluating values against that defined schema. This could work either by fetching the format registry at runtime, or (probably more likely) have some sort of automated job that fetches the registry and imports/processes it. Either way, many new formats could be supported without having to implement code changes

I think this would need to be optional for formats though, since some (e.g. html) would be too complex to be accurately defined as a JSON Schema. Hence why I used the phrase "many new formats" and not "all new formats"!

GREsau avatar May 27 '25 17:05 GREsau

The registry formats are not versioned within the registry, per se, because the registry doesn't define the formats. The registry is merely a central place for formats to be listed. They are defined by other specifications, which (one could assume) are versioned somehow. And the registry should point to those definitions.

The current requirement in our Validation spec is:

Implementations SHOULD support the formats listed in this registry as if they were defined by this document.

The "SHOULD" is a strong recommendation, but implementations could choose not to support them. It's expected that implementations that don't support the more commonly-used formats will receive pressure to do so from their users, or they won't be used.


Having a schema for formats is a really good idea, especially to enable automatic support. However, it's not going to work for all formats, such as email (there's not a definitive (and performant) regex that works 100%).

One thing to be careful of, though, is expecting implementations to make a network call to search the registry. That's not something that we want to promote. Instead they should support a mechanism that allows their user to register formats, and that could include a format schema.

gregsdennis avatar May 27 '25 20:05 gregsdennis

But what if the schema is using a format that was only recently defined in the format registry, and the validator hasn't been updated yet?

I don't think the expectation is that all implementations will support every format. The OpenAPI registry already has around 50 entries and I wouldn't be surprised if that grows to hundreds over time. That's a fairly unrealistic burden to expect implementations to support them all. I expect implementations will provide some subset and a plugin mechanism for whatever they don't support. Users will need to configure support for the formats they need.

but that seems like a major interoperability headache for users

Yep. format has always been an interoperability headache. The registry helps make sure a format in one schema means the same thing as a format in another schema, but it doesn't mean every validator will support it.

As a separate matter, have you considered including a JSON schema in the registry for formats where possible?

That's an interesting thought. As I thought about using the registry to pull schemas from, it occurred to me, isn't that just a different kind of referencing system specifically for formats? In that case, we might as well use our existing referencing system and publish a kind of standard library of schemas that people could use instead of using a format. The vast majority of the formats already listed can be validated with just a schema, so maybe we limit ourselves to only using format for things like dates that can't be validated with JSON Schema alone. That might even make the need for a registry so small, we might as well just put whatever we need directly in the spec.

Considering formats that can't be fully validated by a JSON Schema, I don't think I would encourage implementations to use a schema in the registry as a fallback because you could only partially validate the value. You can validate that 2025-02-29 is structured as a valid date, but not that 2025 isn't a leap year and should have a February 29. Allowing for partial validation creates more interoperability headache than just saying "I don't know" and requiring the implementation to be configured to support the format.

jdesrosiers avatar May 28 '25 20:05 jdesrosiers

As I thought about using the registry to pull schemas from, it occurred to me, isn't that just a different kind of referencing system specifically for formats? In that case, we might as well use our existing referencing system and publish a kind of standard library of schemas that people could use instead of using a format.

Sure, we could publish a directory of definitions, e.g.:

$id: https://json-schema.org/definitions/uint64
$schema: ...
description: ...
anyOf:
  - type: integer
    minimum: 0
    maximum: 18446744073709551616
  - type: string
    # full regex omitted for readability
    pattern: '^([1-9][0-9]{1,18}|1([0-7][0-9]{18}|....])$'

..and then, of course, someone just does: "$ref": "https://json-schema.org/definitions/uint64" in their schema.

karenetheridge avatar Jun 08 '25 23:06 karenetheridge