spdx-spec icon indicating copy to clipboard operation
spdx-spec copied to clipboard

SPDX canonicalization should use rdf-canon

Open VladimirAlexiev opened this issue 1 month ago • 3 comments

(Split from https://github.com/spdx/spdx-spec/issues/1279)

  • https://spdx.github.io/spdx-spec/v3.0.1/serializations/#canonical-serialization is too weak
  • https://spdx.github.io/canonical-serialisation/ expresses legitimate goals, but is nearly empty, and i read in "meetings" that group is currently inactive.

The problem of RDF and JSON-LD c14n is faced by other communities as well, most importantly Verifiable Credentials.

The following wirk should be reused:

https://w3c.github.io/rdf-canon/spec/

Its scope/abstract is very cognate with SPDX c14n goals:

At times, it becomes necessary to compare the differences between sets of graphs, digitally sign them, or generate short identifiers for graphs via hashing algorithms. This document outlines an algorithm for normalizing RDF datasets such that these operations can be performed.

It addresses a difficult problem:

Most RDF datasets can be canonicalized fairly quickly, in terms of algorithmic time complexity. However, those that contain nodes that do not have globally unique identifiers pose a greater challenge. Normalizing these datasets presents the graph isomorphism problem

JSON-LD mentions c14n a couple of times, and refers to

JSON canonicalization is described in Data Round Tripping in JSON-LD11-API:

https://w3c.github.io/json-ld-api/#data-round-tripping

VladimirAlexiev avatar Nov 18 '25 04:11 VladimirAlexiev

spdx.github.io/spdx-spec/v3.0.1/serializations#canonical-serialization is too weak

Can you explain what you mean by "too weak"?

The other website was only active when the working group was active. It has produced the canonical serialization format that is included in the spec and essentially disbanded -- nothing else to do.

The RDF Canonicalization group work happened long after the SPDX canonical serialization had completed its task. And as you point out, it addresses a different problem: canonicalization of data. The SPDX working group had as only goal to have a canonical serialization (byte representation) of some data.

zvr avatar Nov 18 '25 09:11 zvr

@zvr The same RDF graph can be serialized in JSONLD in many different ways. Choices of:

  • what objects to nest vs what to keep at top level (Framing)
  • how exactly to shorten terms (details of the Context)
  • order of nodes
  • representation of numbers. JSON doesn't define numbers strictly, so if you round-trip a number you have no guarantee of type (eg 1.000 can come out as integer) or representation (eg 123.4567 vs 1.23458e2). A good advice is to emit in JSONLD even numbers as strings, and always have a defined datatype (best in the context)

If the goal is to have a single canonic serialization of each SPDX graph as a unique JSON-LD string, then I think the spec page doesn't say enough on the topic.

A lot of these are regulated by the JSON schema, but how do you ensure that an SPDX RDF graph when serialized will conform to the schema?

VladimirAlexiev avatar Nov 18 '25 18:11 VladimirAlexiev

Right, that's exactly what I was trying to convey. The whole point of the "SPDX canonical serialization" is to have a unique way of representing some data. This can be used to check whether two pieces of data are exactly the same.

We decided early on that to decide whether two pieces of data are "equivalent" (or, represent the same thing) is outside the scope of this work. You might say that we only care about "syntactic" identity and not any kind of "semantic" one. To be honest, the latter is too hard a problem, and would eventually grow to become un-manageable, once every type was to be handled.

zvr avatar Nov 18 '25 20:11 zvr