regen-ledger
regen-ledger copied to clipboard
Data module alpha2: Full schema
This PR removes support for IPFS CIDs and replaces it with a custom hash-based content identifiers that better suit our use cases.
I find IPFS CID's suboptimal and am removing them because:
- CID's are supposed to cover both the content format and hash but if you look at the multicodec table there are only a few data formats supported (very few of which apply to our use case) and lots of hash algorithms of which we will only support a few (for some reason
skein
hashes take up most of the table). Yes I could open PRs for every format we want, but the project just isn't that active, and also it feels wrong to list so many things in the codec which we have zero intention of supporting. Instead, I would rather choose the formats and hashes we think we'll use and not have a bunch of random extraneous stuff. - I actually don't see a good way to support canonicalization algorithms given this specification, in particular RDF graphs. With canonicalization, a hash represents a canonical representation of some data which isn't tied to a specific serialization. So for RDF data we wouldn't really say "json-ld" or "n-quads" as the format, but rather we would say the canonicalization algorithm is URDNA2015 with a SHA-256 digest and json-ld or n-quads is something that can be used at the transport or storage layer but that's really up to specific implementations. Without going into too many details, it feels like this mental model is just a bad match for the way CID v1 is setup. Maybe they'll change it later, but that doesn't feel like our problem given limited bandwidth.
What I done instead is define a custom ID
message type with:
- an
IDType
enum specifying either:- "raw" data which has a corresponding
MediaType
, - RDF graph data, or
- canonicalized geographic data
- "raw" data which has a corresponding
- a hash
DigestAlgorithm
- a configurable
GraphCanonicalizationAlgorithm
for RDF data
High-level intentions of this design are to:
- use the RDF graph data model as the basis for structured on-chain data to
- align with W3C oriented work like DIDs, JSON-LD, Verifiable Credentials, and Linked Data Proofs
- have a data format which already talks about canonicalization and multiple data representations rather than having to do that work on our own like we did with protobuf
- ease coordination of on and off-chain data using RDF/"Linked Data" which is about distributed data to begin with
- have flexible, extensible schemas and an easy mapping to database analytics via SPARQL/quad stores
- have canonical hashed representations of geographic data to ease coordination of on and off-chain data
- allow for a useful set of raw data formats to be hashed including PDFs, media files that may be associated with projects, and basic stuff like CSV and JSON files (although hopefully RDF will be preferred for that data)
Follow-ups after this:
- update implementation and remove
v1alpha1
- specify URI mappings for IDs
- support on-chain graph data storage
Codecov Report
Merging #221 (5a03f48) into master (0048984) will not change coverage. The diff coverage is
n/a
.
@@ Coverage Diff @@
## master #221 +/- ##
=======================================
Coverage 62.73% 62.73%
=======================================
Files 47 47
Lines 2965 2965
=======================================
Hits 1860 1860
Misses 893 893
Partials 212 212
Thanks for this @aaronc ! In general i'm very much in favor of us being more explicit about data types when thinking about anchoring/signing/storing functionality as described here. A few questions I have about the specific approach though:
- What is our expectation of offchain data storage? With "CID"s from IPFS we get with the hash a mechanism to look up the actual content of the data on an offchain network. That is no longer the case here. @aaronc and I briefly discussed this on a call yesterday, and it seems like the strategy we might pursue for this is having a "RegisterDataResolver" type of msg that allows one to register a URL / IRI / etc. that can be used to retrieve the contents of a given
ContentID
.
So I think we could still use IPFS, we'd just maybe specify that all content types map to raw
in IPFS expect for maybe JSON.
We'd also want to specify an API (maybe in gRPC) for services which store content-addressable data, possibly with proofs for querying parts of graphs.
- For RDF data and polygon data, are there use cases in which we expect end users to only anchor some geometric data / RDF data, and not actually store it on chain? How would our approach differ if all RDF & geometric data were always expected to be stored (i imagine if we enforce this, it may be to our benefit). Would that change any of our design choices here? Perhaps the reason for still having a
ContentID
associated with a geopolygon and RDF graph is that it can then allow RDF's to reference other RDF graphs via theContentID
?
Yeah, basically this design allows graph data to refer to other graph or geography data whether this lives on or off chain. Credit classes might require data to live on-chain in their metadata schemas.
- If we look to concrete use cases, I imagine that "Geography" data is actually not data that lives in isolation, but rather it is more likely to exist as metadata associated with a piece of media data, other kind of raw data. Can you elaborate on a concrete use case for geography data being "signed" in isolation? What is that signing / timestamping of a geography object attesting to?
Geographic data generally wouldn't be signed in isolation. Its a separate type because its large (compared to graphs) and could be reused between multiple graphs... A graph might be ~1kb vs a polygon that's 100-200kb depending on the size of land and sampling resolution.
So one thing I'm thinking is where Msg/Sign
should really only apply to graphs... because other formats can't necessarily be implied to have implicit meaning. Except maybe PDFs?
So one thing I'm thinking is where Msg/Sign should really only apply to graphs... because other formats can't necessarily be implied to have implicit meaning. Except maybe PDFs?
Makes sense. I think I agree with this logic. I'd rather not have exceptions for specific media types (e.g. PDF), but rather just choose one strategy & stick with it.
In general my inclination is to pick something that satisfies our most basic needs & use cases, and iterate later as our needs evolves.
@aaronc would you like to include MsgStoreRDFData
and MsgStoreGraphData
definitions in this PR as well? Or tackle those in a follow up ?
@aaronc would you like to include
MsgStoreRDFData
andMsgStoreGraphData
definitions in this PR as well? Or tackle those in a follow up ?
Is it helpful to see the full picture? Is so then yes I can
I made a few updates to show the bigger picture. There are a few more things that I have in mind which I'll try to get to later today
@clevinson @blushi so the thing I was trying to explain there at the end is that right now I have three ways of representing each piece of data:
- IRIs - these exist in the actual graphs when users see them and when they're canonicalized
-
uint64
IDs - these would be auto-generated for all IRIs including ones likehttps://regen.network/schema#...
and are used to save space when storing graphs on chain and sendingMsg
s -
ContentHash
this is really a descriptor for what gets serialized into the IRIs
We could maybe get rid of ContentHash
from most APIs and just have users construct IRIs themselves, although I think it's useful to have ContentHash
to explicitly construct IRIs with the right fields set.
The uint64
IDs feel useful to reduce storage space and just having fixed 8-byte index keys will make managing the KV store a lot simpler. We could maybe mostly remove them from user-facing APIs and just use IRIs and ContentHash
s, but I do like the idea of this built-in compression which is why this API uses uint64
s pretty often.
The workflow for constructing one fo these compressed Graph
data structures which uses uint64
IDs shouldn't be so bad. Here's what someone would do:
- create a dataset in JSON-LD and compute the URDNA2015 hash on the client side
- call
Query/ConvertToCompactDataset
with the JSON-LD content and this will return aGraph.Dataset
withuint64
IDs filled in - (if needed) If some IRIs are not registered,
Query/ConvertToCompactDataset
will return a list of those IRIs and instruct the user to callMsgRegisterIRIs
with those IRIs. Then the user will be able to callQuery/ConvertToCompactDataset
without errors - call
Msg/StoreGraphData
with the compactGraph.Dataset
and the URDNA2015 hash generated client side encoded intoContentHash.Graph
. This way there is efficiency at the storage and transport layer with no need for the user to fully trust a node because of the client-side URDNA2015 hash -
Msg/StoreGraphData
will return auint64
ID which the user can use for other calls likeMsg/ValidateGraph
if needed. The IRI internally representing this graph can also be retrieved withQuery/IRIsByID
.
Does that seem reasonable? Thoughts on this API design and usage of these three types of identifiers in different places?
Also in terms of URI construction, we could also just always use an extension and hide everything else beneath base58 encoding, so for a dataset we could actually have regen:6EqBg1puBvfGDxMSbxah9R.rdf
instead of regen:g/6EqBg1puBvfGDxMSbxah9R
. We could maybe use a .geo
or .wkt
extension for geography and the regular extension for other raw data types... Then maybe IRIs would be a bit more consistent. What do you think?
The uint64 IDs feel useful to reduce storage space and just having fixed 8-byte index keys will make managing the KV store a lot simpler. We could maybe mostly remove them from user-facing APIs and just use IRIs and ContentHashs, but I do like the idea of this built-in compression which is why this API uses uint64s pretty often.
Having users manage the cognitive overhead of two separate IDs as well as manually registering all IRIs feels like a lot of complexity to be introducing to app & client developers. If the concern is just about storage, why can't we just use something under the hood that doesn't get exposures to users at all (similar to how we handle protobuf any compression in sdk).
I feel like we're already asking a lot of end users to be learning the world of RDF / SHACL (and for good reason), and would love to keep additional complexity / DevX burdon to a minimum.
Can we get most of the performance / storage benefits you desire by taking a more under-the-hood approach to compression?
Well we can't compress the transactions that are in the tendermint block store. And compared to protobuf, these IRIs make up the primary content of RDF. And even if we got rid of the uint64 IDs, I wouldn't feel comfortable directly parsing json-ld or turtle in the state machine. I mean maybe we could, but it wouldn't necessarily be less work to implement and I want to limit on chain state machine logic to the minimum. A shacl subset already feels relatively complex. So I think there would be a conversion step from json ld to another format either way.
One thing we could do is only use uint64 IDs in the compact graph format for StoreGraphData and other methods would only use ContentHash or the IRIs. Maybe as a starting point I can make that change. Wdyt?
Actually I think I've thought of a way to hide most of the complexity of the uint64 IDs. I'll update this PR with the new design soon.
- create a dataset in JSON-LD and compute the URDNA2015 hash on the client side
- call
Query/ConvertToCompactDataset
with the JSON-LD content and this will return aGraph.Dataset
withuint64
IDs filled in- (if needed) If some IRIs are not registered,
Query/ConvertToCompactDataset
will return a list of those IRIs and instruct the user to callMsgRegisterIRIs
with those IRIs. Then the user will be able to callQuery/ConvertToCompactDataset
without errors- call
Msg/StoreGraphData
with the compactGraph.Dataset
and the URDNA2015 hash generated client side encoded intoContentHash.Graph
. This way there is efficiency at the storage and transport layer with no need for the user to fully trust a node because of the client-side URDNA2015 hashMsg/StoreGraphData
will return auint64
ID which the user can use for other calls likeMsg/ValidateGraph
if needed. The IRI internally representing this graph can also be retrieved withQuery/IRIsByID
.
So now I've simplified this to the follow:
- create a dataset in JSON-LD and compute the URDNA2015 hash on the client side
- call
Query/ConvertToCompactDataset
with the JSON-LD content and this will return aCompactDataset
- call
Msg/StoreGraphData
with the compactCompactDataset
and the URDNA2015 hash generated client side encoded intoContentHash.Graph
. This way there is efficiency at the storage and transport layer with no need for the user to fully trust a node because of the client-side URDNA2015 hash -
Msg/ValidateGraph
now usesContentHash.Graph
to identify the data and shapes graphs, with optional named graph strings in case we're pointing to a graphs within datasets
So now:
- all of the
Msg
APIs now useContentHash.Graph
to refer to graphs - the
uint64
IDs are hidden behindCompactDataset
for most use cases, with a few utility methods which most users wouldn't need to use - IRIs only get exposed on the query side
Let me know what you think @clevinson.
One thought I have is whether the IRIs should play a role at all in the Msg
APIs - maybe in the return types, but maybe it's fine as it is and they mostly just get exposed on the query side of things.
This is on hold until we revisit on-chain storage