jena icon indicating copy to clipboard operation
jena copied to clipboard

Implement RDF Dataset Canonicalization (RDFC-1.0) with Canonical N-Quads Output

Open kishorebanala opened this issue 3 months ago • 3 comments

Version

jena-5.5.0

Feature

Description:

Implement W3C RDF Dataset Canonicalization (RDFC-1.0) algorithm in Apache Jena with output in canonical N-Quads format. This enables deterministic serialization of RDF datasets by assigning canonical identifiers to blank nodes.

References:

  • RDFC-1.0 Algorithm: https://www.w3.org/TR/rdf-canon/
  • Canonical N-Quads Format: https://www.w3.org/TR/rdf12-n-quads/#canonical-quads
  • W3C Test Suite: https://github.com/w3c/rdf-tests/tree/main/rdf/rdf12/rdf-n-quads/c14n and https://w3c.github.io/rdf-canon/tests/

Tasks:

  • [ ] Create NQuadsCanonicalWriter class extending WriterDatasetRIOTBase
  • [ ] Add NQUADS_CANONICAL format constant to RDFFormat
  • [ ] Register canonical writer factory in RDFWriterRegistry
  • [ ] Implement RDFC10Canonicalizer with complete RDFC-1.0 algorithm
    • [ ] Create HashUtils for SHA-256 hash computations and lexicographic sorting
    • [ ] Implement CanonicalIssuer for _c14n_N blank node identifier assignment
    • [ ] Add DatasetProcessor for blank node extraction and dataset processing
  • [ ] Download and integrate W3C canonicalization test suite to jena-arq/testing/rdf12-wg/rdf-n-quads-c14n/
  • [ ] Update Scripts_RIOT_c14n.java test factory following existing RIOT patterns
  • [ ] Implement RDFCanonicalizationTest for algorithm validation leveraging https://w3c.github.io/rdf-canon/tests/
  • [ ] Add writeCanonical() and canonicalizeDataset() methods to RDFDataMgr
  • [ ] Add --canonical flag support to riot command line tool
  • [ ] Update documentation and create usage examples

Are you interested in contributing a solution yourself?

Yes

kishorebanala avatar Sep 22 '25 05:09 kishorebanala

Looks good!

As part of wider maintenance on the tests, there is a current copy of rdf-tests including the C14N tests for N-Quads.

Scripts_c14n is ready ... just @disabled at the moment.

""" Add writeCanonical() and canonicalizeDataset() methods to RDFDataMgr """

Is it needed for RDFDataMgr changes? (RDFDataMgr predates RDFWriterBuilder and so is nowadays a bit of a slightly higher, legacy view of the world..=

writeCanonical can be done with RDFWriter.format(...) or one of the existing RDFDataMgr.write(..., RDF Format) operations.

Have a new class CanonicalizeDataset with a .write (which might be the only public operation). Somewhere to put the javadoc!

afs avatar Sep 28 '25 16:09 afs

@afs thank you!

I'm still not sure about targeting https://github.com/w3c/rdf-tests/tree/main/rdf/rdf12/rdf-n-quads/c14n vs https://w3c.github.io/rdf-canon/tests/. It appears that the tests for rdf12 doesn't cover blank nodes, while rdf-canon does. Should we also consider including https://w3c.github.io/rdf-canon/tests/ ?

kishorebanala avatar Oct 04 '25 18:10 kishorebanala

There are two different uses of "canoicalization" here:

  1. Canonical N-Quads output - this is about the layout of the characters and is in rdf12/rdf-n-quads/c14n. The order of quads in the file is not defined but Jena's N-Quads writer actually writes an Iterator<Quad> (or easily by addingRDFStream).
  2. Canonical dataset - which is the RDFC-1.0 algorithm - which is a consistent labelling of blank nodes. 4. Canonicalization and 5. Serialization.

In (2), in serialization, rdf-canon sorts the n-quads and outputs them in order, in the canonical N-Quads form, preserving the blank node labelling from RDFC-1.0. The detailed form of RDFTerms is done by a NodeFormatterNT in WriterStreamRDFPlain.

There are choices in how to write RDF terms. e.g. lowercase language tags, ECHAR vs UCHAR. This is (1) 3. A Canonical form of N-Quads and it should be exactly the same as A. A Canonical form of N-Quads because the text was copied over before making some presentation changes. There are other uses for canonical n-quads so having it available on it's own is valuable.

It may be easier to write a new N-Quads writer rather than try to use inheritance and modularity of the existing one because the algorithm of writing N-quads is very small; the common part is only a few lines. We can always combine them later.

afs avatar Oct 05 '25 09:10 afs