Implement RDF Dataset Canonicalization (RDFC-1.0) with Canonical N-Quads Output
Version
jena-5.5.0
Feature
Description:
Implement W3C RDF Dataset Canonicalization (RDFC-1.0) algorithm in Apache Jena with output in canonical N-Quads format. This enables deterministic serialization of RDF datasets by assigning canonical identifiers to blank nodes.
References:
- RDFC-1.0 Algorithm: https://www.w3.org/TR/rdf-canon/
- Canonical N-Quads Format: https://www.w3.org/TR/rdf12-n-quads/#canonical-quads
- W3C Test Suite: https://github.com/w3c/rdf-tests/tree/main/rdf/rdf12/rdf-n-quads/c14n and https://w3c.github.io/rdf-canon/tests/
Tasks:
- [ ] Create NQuadsCanonicalWriter class extending WriterDatasetRIOTBase
- [ ] Add NQUADS_CANONICAL format constant to RDFFormat
- [ ] Register canonical writer factory in RDFWriterRegistry
- [ ] Implement RDFC10Canonicalizer with complete RDFC-1.0 algorithm
- [ ] Create HashUtils for SHA-256 hash computations and lexicographic sorting
- [ ] Implement CanonicalIssuer for _c14n_N blank node identifier assignment
- [ ] Add DatasetProcessor for blank node extraction and dataset processing
- [ ] Download and integrate W3C canonicalization test suite to jena-arq/testing/rdf12-wg/rdf-n-quads-c14n/
- [ ] Update Scripts_RIOT_c14n.java test factory following existing RIOT patterns
- [ ] Implement RDFCanonicalizationTest for algorithm validation leveraging https://w3c.github.io/rdf-canon/tests/
- [ ] Add writeCanonical() and canonicalizeDataset() methods to RDFDataMgr
- [ ] Add --canonical flag support to riot command line tool
- [ ] Update documentation and create usage examples
Are you interested in contributing a solution yourself?
Yes
Looks good!
As part of wider maintenance on the tests, there is a current copy of rdf-tests including the C14N tests for N-Quads.
Scripts_c14n is ready ... just @disabled at the moment.
""" Add writeCanonical() and canonicalizeDataset() methods to RDFDataMgr """
Is it needed for RDFDataMgr changes? (RDFDataMgr predates RDFWriterBuilder and so is nowadays a bit of a slightly higher, legacy view of the world..=
writeCanonical can be done with RDFWriter.format(...) or one of the existing RDFDataMgr.write(..., RDF Format) operations.
Have a new class CanonicalizeDataset with a .write (which might be the only public operation).
Somewhere to put the javadoc!
@afs thank you!
I'm still not sure about targeting https://github.com/w3c/rdf-tests/tree/main/rdf/rdf12/rdf-n-quads/c14n vs https://w3c.github.io/rdf-canon/tests/. It appears that the tests for rdf12 doesn't cover blank nodes, while rdf-canon does. Should we also consider including https://w3c.github.io/rdf-canon/tests/ ?
There are two different uses of "canoicalization" here:
- Canonical N-Quads output - this is about the layout of the characters and is in rdf12/rdf-n-quads/c14n. The order of quads in the file is not defined but Jena's N-Quads writer actually writes an
Iterator<Quad>(or easily by addingRDFStream). - Canonical dataset - which is the RDFC-1.0 algorithm - which is a consistent labelling of blank nodes. 4. Canonicalization and 5. Serialization.
In (2), in serialization, rdf-canon sorts the n-quads and outputs them in order, in the canonical N-Quads form, preserving the blank node labelling from RDFC-1.0. The detailed form of RDFTerms is done by a NodeFormatterNT in WriterStreamRDFPlain.
There are choices in how to write RDF terms. e.g. lowercase language tags, ECHAR vs UCHAR. This is (1) 3. A Canonical form of N-Quads and it should be exactly the same as A. A Canonical form of N-Quads because the text was copied over before making some presentation changes. There are other uses for canonical n-quads so having it available on it's own is valuable.
It may be easier to write a new N-Quads writer rather than try to use inheritance and modularity of the existing one because the algorithm of writing N-quads is very small; the common part is only a few lines. We can always combine them later.