yaml-ld icon indicating copy to clipboard operation
yaml-ld copied to clipboard

Serializing JSON or YAML literal in YAML-LD

Open gkellogg opened this issue 3 years ago • 25 comments

The YAML examples in the JSON-LD 1.1 spec (e.g., https://github.com/w3c/json-ld-syntax/blob/main/yaml/JSON-Literal-compacted.yaml), do not preserve the JSON serialization of a JSON literal.

Example 062: JSON Literal-compacted
---
"@context":
  "@version": 1.1
  e:
    "@id": http://example.com/vocab/json
    "@type": "@json"
e:
- 56.0
- d: true
  '10': 
  '1': []

It should, instead be the following:

Example 062: JSON Literal-compacted
---
"@context":
  "@version": 1.1
  e:
    "@id": http://example.com/vocab/json
    "@type": "@json"
e: [56.0,{"d":true,"10":null,"1":[]}]

But a simple YAML.dump of the parsed JSON does not take this into consideration. The spec should describe the requirements for serializing JSON literals in YAML-LD.

gkellogg avatar Jul 02 '22 15:07 gkellogg

One stated goal is to be able to use something like YAML.dump of the parsed JSON/YAML, which will likely not allow defining how data is serialized in these cases. This should probably be at most a SHOULD requirement and maybe best left to an extended profile. Implementing it requires tagging the object which is the root of the JSON Literal and writing a custom emitter to serialize as JSON which is a significantly more involved serialization strategy, particularly given the need to interpret the in-scope local context to know if a map entry value should be treated as a JSON Literal.

The YAML examples cited above are generated essentially by YAML.dump(JSON.load(src)), where there is no notion of a local context.

gkellogg avatar Jul 02 '22 22:07 gkellogg

It seems to be that the two YAML snippets above serialize to the same JSON (and this is confirmed by a quick test on https://www.convertjson.com/yaml-to-json.htm), so I don't understand where the issue is. :thinking:

pchampin avatar Jul 03 '22 15:07 pchampin

It’s probably a more a more philosophical question: Must a JSON Literal necessarily have the form of JSON?

gkellogg avatar Jul 03 '22 16:07 gkellogg

It's also a pragmatic question:

  1. When converting to RDF, a @json literal should be treated as opaque and left alone, see https://w3c.github.io/json-ld-syntax/#the-rdf-json-datatype. I have more examples of such needs:
    • GraphDB-Lucene connectors are defined using JSON, eg see https://graphdb.ontotext.com/documentation/10.0/lucene-graphdb-connector.html#using-the-create-command (we're not necessarily proud of this, eg you can't use prefixes when defining RDF prop paths to be indexed, which sucks)
    • GeoSPARQL asWKT and asGML are opaque strings (with appropriate datatype)
  2. What should a reader expect when seeing @type:@json or "..."^^rdf:JSON. If they expect JSON but find YAML, they may be unable to process it.
  3. I think we also need to declare @type:@yaml and "..."^^rdf:YAML

VladimirAlexiev avatar Jul 04 '22 08:07 VladimirAlexiev

I just learn now about JSON Literals... I think it is a very complex feature if you see it as a literal, because even JSON parsers will not treat it as you might expect.

For example, a JSON Literal with duplicate keys will not be treated as literal by generic JSON parsers:

{
  "@context": {
    "@version": 1.1,
    "e": {
      "@id": "http://example.com/vocab/json",
      "@type": "@json"
    }
  },
  "e": {
    "a": "ciao",
    "a": 1
  }
}

will result in an entry with the last (or the first, it's actually implementation dependent) removed. How does JSON-LD handle these cases?

{
  "@context": { ... },
  "e": {
    "a": 1
  }
}

ioggstream avatar Jul 04 '22 10:07 ioggstream

@ioggstream https://w3c.github.io/json-ld-syntax/#the-rdf-json-datatype says "The lexical space is the set of UNICODE strings which conform to the JSON Grammar". Hopefully that includes only valid JSON representations, i.e. no duplicate keys.

This is not an optional feature. It's part of the JSON-LD spec, so it must be supported in YAML-LD.

I provided a real-world use case for it: GraphDB connectors for Lucene, SOLR, Elastic (https://graphdb.ontotext.com/documentation/10.0/connectors.html#full-text-search-and-aggregation-connectors)

VladimirAlexiev avatar Jul 04 '22 14:07 VladimirAlexiev

The JSON-LD Literal definition is written to allow a variation in representation. The JCS C14N considerations only come into play when describing the representation within RDF Triples. Similar to rdf:XMLLiteral it's original intent is to allow for some portion of an XML document to be referenced as a literal across different encodings (also rdf:RDFA).

The JSON-LD spec says non-normatively that values of @json (or properties with "@type": "@json" ) are treated as JSON Literals. IMO, YAML-LD is free to innovate here. As there is a simple transformation from any YAML to JSON, a value of @json could still have a more general YAML format, as long as the result can be transformed into the value space (involving JCS). That said, a SHOULD statement on using the JSON sub-set of YAML seems reasonable, and allows for implementations that cannot reasonably conform to this.

gkellogg avatar Jul 04 '22 17:07 gkellogg

@VladimirAlexiev said:

  1. When converting to RDF, a @json literal should be treated as opaque and left alone, see https://w3c.github.io/json-ld-syntax/#the-rdf-json-datatype. I have more examples of such needs:

Then converting to RDF triples; a given serialization may have different ways of representation that. The JSON-LD from RDF algorithm describes the mechanism to use when transforming a triple containing an RDF Literal into JSON-LD.

  1. What should a reader expect when seeing @type:@json or "..."^^rdf:JSON. If they expect JSON but find YAML, they may be unable to process it.

Two different things. A JSON-LD processor may see JSON-LD with an explicit value of type rdf:JSON, where the value is a JCS encoded string, which would not automatically be turned into the internal @json value object representation.

  1. I think we also need to declare @type:@yaml and "..."^^rdf:YAML

I think we need demonstrate a need here. The rdf:JSON literal was not established lightly. What evidence is there for the use of YAML literals in the wild?

gkellogg avatar Jul 04 '22 17:07 gkellogg

@VladimirAlexiev Afaik JSON grammar allows duplicate keys. You need JCS to forbid duplicate keys

@gkellogg

A SHOULD statement on using the JSON sub-set of YAML seems reasonable, and allows for implementations that cannot reasonably conform to this.

What do you mean with "JSON subset"? If you mean something like the "internal representation" than its feasible. Otherwise I think that we can only check that the representation graph maps to the expected JSON literal when serialised in JSON.

ioggstream avatar Jul 04 '22 20:07 ioggstream

@VladimirAlexiev Afaik JSON grammar allows duplicate keys.

No, I believe this has been addressed by RFC8259:

The names within an object SHOULD be unique.

Not a MUST, but that is because of concerns over backwards compatibility. The interoperation of when duplicate keys are present is unspecified, as different implementations do different things.

Also JCS / RFC8785 prohibits objects from having duplicate keys:

JSON objects MUST NOT exhibit duplicate property names.

gkellogg avatar Jul 04 '22 20:07 gkellogg

@gkellogg

treated as JSON Literals ...

Does JSON-LD use JCS or JSON? What happens in the case of the JSON literal I wrote above ? https://github.com/json-ld/yaml-ld/issues/36#issuecomment-1173637884

ioggstream avatar Jul 04 '22 23:07 ioggstream

With regard to JSON Literals, the spec uses JCS. IIRC, the spec is silent on duplicate keys, and as noted in the RFCs, May have different behaviors. This is at least a SHOULD. But, for the specific car of JSON Literals, duplicate keys would violate the requirements of JCS.

gkellogg avatar Jul 05 '22 05:07 gkellogg

What do you mean with "JSON subset"? If you mean something like the "internal representation" than its feasible. Otherwise I think that we can only check that the representation graph maps to the expected JSON literal when serialised in JSON.

What I meant by "JSON subset" is the subset of YAML which is, effectively JSON. I.e., the arrays, objects and native values that both YAML and JSON share. Perhaps there is another term for this.

The JSON-LD Internal Representation of a JSON Object is, however, an Infra map, which is defined specifically to have unique key/value pairs. All JSON-LD algorithms operate by transforming the JSON surface syntax into the internal representation, which will end up eliminating duplicate keys, in any case.

gkellogg avatar Jul 05 '22 06:07 gkellogg

JSON Literals, the spec uses JCS

iiuc:

"JSON subset" is the subset of YAML which is, effectively JSON .. Infra map ...

Infra map: ordered sequence of key/value pairs. Keys are unique. Keys are strings. YAML: unordered sequence of key/value pairs. Keys are unique. Keys can be arbitrary nodes.

About ordering

JSON libraries do not usually preserve ordering. I suspect that it is in general not a problem since iiuc

  1. a JSON-LD parser receiving a JSON Literal will c14n it and sort JSON objects keywords
  2. @type: @json stores the JSON-LD Internal Representation and not the verbatim JSON text

About YAML-LD

IF JSON Literals are about Internal representation (the serialization always happens via JCS) then I think we do not need a @type: @yaml because the data model is always the JSON one, and serialization happens via JCS.

We only need @yaml if we decide to extend the JSON-LD data model.

WDYT?

ioggstream avatar Jul 05 '22 08:07 ioggstream

This issue was discussed on the Aug 03 meeting.

gkellogg avatar Aug 03 '22 17:08 gkellogg

@ioggstream -- Please edit your https://github.com/json-ld/yaml-ld/issues/36#issuecomment-1174751556 and wrap code fences (either single or triple backticks) around all @terms that aren't meant to link to GitHub users (e.g., `@yaml`, `@type`, `@JSON`), because the users behind those handles probably aren't interested in our discussions and don't need alerts on every comment made here...

TallTed avatar Aug 05 '22 14:08 TallTed

@ioggstream -- Please edit your #36 (comment) and wrap code fences (either single or triple backticks) around all @terms that aren't meant to link to GitHub users (e.g., `@yaml`, `@type`, `@JSON`), because the users behind those handles probably aren't interested in our discussions and don't need alerts on every comment made here...

I took care of it.

gkellogg avatar Aug 05 '22 17:08 gkellogg

I propose closing this saying that YAML-LD has no specific encoding requirements for @json value objects as long as round-tripping YAML to JSON reproduces an equivalent structure.

gkellogg avatar Aug 06 '22 20:08 gkellogg

@gkellogg can you please check if this way of using @json in YAML is consistent with the above words?

https://github.com/ioggstream/draft-polli-restapi-ld-keywords/pull/3/files

ioggstream avatar Aug 09 '22 14:08 ioggstream

@gkellogg can you please check if this way of using @json in YAML is consistent with the above words?

https://github.com/ioggstream/draft-polli-restapi-ld-keywords/pull/3/files

Yes, that seems reasonable.

gkellogg avatar Aug 10 '22 00:08 gkellogg

@gkellogg

I think we also need to declare @type:@yaml and "..."^^rdf:YAML I think we need demonstrate a need here. The rdf:JSON literal was not established lightly. What evidence is there for the use of YAML literals in the wild?

Uh, wouldn't YAML-LD provide thousands of such examples?

I think we need to consider JSON and YAML literals completely independently of whether or not they have any relation to LD (just like rdf:XMLLiteral is not RDF XML).

  • @json should be true JSON, not YAML that is compatible with JSON
  • So we need @yaml to be able to capture YAML literals that are not JSON (eg use block style), or not even compatible with JSON (eg use anchors & refs)
  • I think that in defining a YAML-based format, it will be a gaping omission not to allow transport of YAML literals!

Let me try to adapt our first example https://graphdb.ontotext.com/documentation/10.0/lucene-graphdb-connector.html#using-the-create-command from Turtle+JSON to YAML-LD+YAML:

'@context': 
  luc: http://www.ontotext.com/connectors/lucene#
  luc-index: http://www.ontotext.com/connectors/lucene/instance#
  ex: http://www.ontotext.com/example/wine#
  rdfs: http://www.w3.org/2000/01/rdf-schema#
luc-index:my_index:
  luc:createConnector: !yaml
    types: [ex:Wine]
    fields:
      - fieldName: grape
        propertyChain: [ex:madeFromGrape, rdfs:label]
      - fieldName: sugar
        propertyChain: [ex:hasSugar]
        analyzed: false
        multivalued: false
      - fieldName: year
        propertyChain: [ex:hasYear]
        analyzed: false

I think you'll agree that's much nicer than the original.

So it's not a question of whether we need it, but how exactly to handle it:

  • !yaml means "don't try to convert the rest to RDF, leave it as YAML"
  • But it better still be parsed as YAML (eg to check the syntax)
  • How about the indentation? And can I put the !yaml just there?
  • I've used CURIEs (prefixes) (not being able to use them is a bad thing in our current implementation), but if that YAML is interpreted as a string, they won't be enacted :-(

Note: if we change our connector implementation to use RDF instead of JSON and add a bit to the context, this becomes straight YAML-LD (notice !yaml is removed but the payload after @context is the same):

'@context': 
  luc: http://www.ontotext.com/connectors/lucene#
  luc-index: http://www.ontotext.com/connectors/lucene/instance#
  ex: http://www.ontotext.com/example/wine#
  rdfs: http://www.w3.org/2000/01/rdf-schema#
  fieldName: {'@id': luc:fieldName}
  types: {'@id': luc:types, '@type': '@id', '@collection': '@list'}
  fields: {'@id': luc:fields, '@type': '@id', '@collection': '@list'}
  propertyChain: {'@id': luc:propertyChain, '@type': '@id', '@collection': '@list'}
  analyzed: {'@id': luc:analyzed, '@type': xsd:boolean}
  multivalued: {'@id': luc:multivalued, '@type': xsd:boolean}
luc-index:my_index:
  luc:createConnector: 
    types: [ex:Wine]
    fields:
      - fieldName: grape
        propertyChain: [ex:madeFromGrape, rdfs:label]
      - fieldName: sugar
        propertyChain: [ex:hasSugar]
        analyzed: false
        multivalued: false
      - fieldName: year
        propertyChain: [ex:hasYear]
        analyzed: false

This YAML-LD will be converted to the following turtle:

luc-index:my_index
  luc:createConnector [
    luc:types (ex:Wine);
    luc:fields (
      [luc:fieldName "grape";
        luc:propertyChain (ex:madeFromGrape rdfs:label)]
      [luc:fieldName "sugar";
        luc:propertyChain (ex:hasSugar);
        luc:analyzed: false;
        luc:multivalued: false]
      [luc:fieldName "year";
        luc:propertyChain (ex:hasYear);
        luc:analyzed: false])]

VladimirAlexiev avatar Sep 19 '22 16:09 VladimirAlexiev

@VladimirAlexiev (or @gkellogg) -- Please edit https://github.com/json-ld/yaml-ld/issues/36#issuecomment-1251223864 and put codefences around the @type:@yaml in the opening quoted block. They don't need pinging about our conversation.

TallTed avatar Sep 21 '22 04:09 TallTed

Done.

gkellogg avatar Sep 21 '22 04:09 gkellogg

This was discussed on [2022-09-28](https://json-ld.org/minutes/2022-09-28/#31)
Vladimir Alexiev: I gave an example from Elastic Search. This connector can be used in indexing.
... Fields have types and other attributes.
... Currently, we implement this in JSON. There's a SPARQL INSERT involved.
... We've wanted to turn that into a better notation, as you can't use prefixes.
... We're thinking of converting it to proper RDF; the question is how to write it.
... If we allow JSON and YAML literals, it would help with the interpretation of that data.
... If JSON was done because it was popular, it makes sense that you be able to store YAML as a literal.
... A good example is GeoJSON. In JSON-LD 1.1, it can be interpreted.
... But, it comes out as a nested list of lists.
... There are textual formats for GeoJSON.
... I think we should have a YAML literal.
Gregg Kellogg: There's the JCS spec to canonize a JSON literal. We don't have such a thing for YAML
... the value of canonization is that then you can compare literals for equality, so that value equality will coincide with lexical equality
Vladimir Alexiev: Ok, I see but 1. RDF doesn't even canonize simple things like xsd:boolean, numbers (123 vs 0123), and even URLs
... 2. We could tackle YAML canonization, in fact I'd like to have that (and standardize pretty-printing parameters, and the ability to capture them in YAML-LD)
Gregg Kellogg: Sorry, out of time for today. We can contniue on next call. Please send in discussion topics for next meeting agenda.
Created https://github.com/json-ld/json-ld.org/issues/797 -> action 797 create a repo for NDJSON-LD [1] (on ) due 5 Oct 2022

gkellogg avatar Sep 30 '22 17:09 gkellogg

From the RDF Semantics

A datatype is understood to define a partial mapping, called the lexical-to-value mapping, from a lexical space (a set of character strings) to values. The function L2V maps datatypes to their lexical-to-value mapping. A literal with datatype d denotes the value obtained by applying this mapping to the character string sss: L2V(d)(sss). If the literal string is not in the lexical space, so that the lexical-to-value mapping gives no value for the literal string, then the literal has no referent. The value space of a datatype is the range of the lexical-to-value mapping. Every literal with that type either refers to a value in the value space of the type, or fails to refer at all. An ill-typed literal is one whose datatype IRI is recognized, but whose character string is assigned no value by the lexical-to-value mapping for that datatype.

The JSON-LD 1.1 Spec defines this for the rdf:JSON literal with a lexical space composed of UNICODE strings conforming to the JSON Grammar and a value space with specific serialization requirements so that two JSON literals can be expressed, say, using different whitespace, but be considered value-equivalent through mapping to the value space via JCS.

For a hypothetical YAML datatype, the lexical space would clearly be the set of all UNICODE strings which conform to the YAML Grammar, but finding the value space is more difficult,, as multiple YAML serializations may be considered to represent the same value. I think a necessary pre-condition for establishing a YAML datatype would be to identify a normative specification for obtaining the canonical form of a YAML document/stream.

gkellogg avatar Sep 30 '22 17:09 gkellogg