regen-ledger icon indicating copy to clipboard operation
regen-ledger copied to clipboard

Data module alpha2: Full schema

Open aaronc opened this issue 4 years ago • 14 comments

This PR removes support for IPFS CIDs and replaces it with a custom hash-based content identifiers that better suit our use cases.

I find IPFS CID's suboptimal and am removing them because:

  • CID's are supposed to cover both the content format and hash but if you look at the multicodec table there are only a few data formats supported (very few of which apply to our use case) and lots of hash algorithms of which we will only support a few (for some reason skein hashes take up most of the table). Yes I could open PRs for every format we want, but the project just isn't that active, and also it feels wrong to list so many things in the codec which we have zero intention of supporting. Instead, I would rather choose the formats and hashes we think we'll use and not have a bunch of random extraneous stuff.
  • I actually don't see a good way to support canonicalization algorithms given this specification, in particular RDF graphs. With canonicalization, a hash represents a canonical representation of some data which isn't tied to a specific serialization. So for RDF data we wouldn't really say "json-ld" or "n-quads" as the format, but rather we would say the canonicalization algorithm is URDNA2015 with a SHA-256 digest and json-ld or n-quads is something that can be used at the transport or storage layer but that's really up to specific implementations. Without going into too many details, it feels like this mental model is just a bad match for the way CID v1 is setup. Maybe they'll change it later, but that doesn't feel like our problem given limited bandwidth.

What I done instead is define a custom ID message type with:

  • an IDType enum specifying either:
    • "raw" data which has a corresponding MediaType,
    • RDF graph data, or
    • canonicalized geographic data
  • a hash DigestAlgorithm
  • a configurable GraphCanonicalizationAlgorithm for RDF data

High-level intentions of this design are to:

  • use the RDF graph data model as the basis for structured on-chain data to
    • align with W3C oriented work like DIDs, JSON-LD, Verifiable Credentials, and Linked Data Proofs
    • have a data format which already talks about canonicalization and multiple data representations rather than having to do that work on our own like we did with protobuf
    • ease coordination of on and off-chain data using RDF/"Linked Data" which is about distributed data to begin with
    • have flexible, extensible schemas and an easy mapping to database analytics via SPARQL/quad stores
  • have canonical hashed representations of geographic data to ease coordination of on and off-chain data
  • allow for a useful set of raw data formats to be hashed including PDFs, media files that may be associated with projects, and basic stuff like CSV and JSON files (although hopefully RDF will be preferred for that data)

Follow-ups after this:

  • update implementation and remove v1alpha1
  • specify URI mappings for IDs
  • support on-chain graph data storage

aaronc avatar Jan 18 '21 16:01 aaronc

Codecov Report

Merging #221 (5a03f48) into master (0048984) will not change coverage. The diff coverage is n/a.

@@           Coverage Diff           @@
##           master     #221   +/-   ##
=======================================
  Coverage   62.73%   62.73%           
=======================================
  Files          47       47           
  Lines        2965     2965           
=======================================
  Hits         1860     1860           
  Misses        893      893           
  Partials      212      212           

codecov[bot] avatar Jan 18 '21 16:01 codecov[bot]

Thanks for this @aaronc ! In general i'm very much in favor of us being more explicit about data types when thinking about anchoring/signing/storing functionality as described here. A few questions I have about the specific approach though:

  1. What is our expectation of offchain data storage? With "CID"s from IPFS we get with the hash a mechanism to look up the actual content of the data on an offchain network. That is no longer the case here. @aaronc and I briefly discussed this on a call yesterday, and it seems like the strategy we might pursue for this is having a "RegisterDataResolver" type of msg that allows one to register a URL / IRI / etc. that can be used to retrieve the contents of a given ContentID.

So I think we could still use IPFS, we'd just maybe specify that all content types map to raw in IPFS expect for maybe JSON.

We'd also want to specify an API (maybe in gRPC) for services which store content-addressable data, possibly with proofs for querying parts of graphs.

  1. For RDF data and polygon data, are there use cases in which we expect end users to only anchor some geometric data / RDF data, and not actually store it on chain? How would our approach differ if all RDF & geometric data were always expected to be stored (i imagine if we enforce this, it may be to our benefit). Would that change any of our design choices here? Perhaps the reason for still having a ContentID associated with a geopolygon and RDF graph is that it can then allow RDF's to reference other RDF graphs via the ContentID ?

Yeah, basically this design allows graph data to refer to other graph or geography data whether this lives on or off chain. Credit classes might require data to live on-chain in their metadata schemas.

  1. If we look to concrete use cases, I imagine that "Geography" data is actually not data that lives in isolation, but rather it is more likely to exist as metadata associated with a piece of media data, other kind of raw data. Can you elaborate on a concrete use case for geography data being "signed" in isolation? What is that signing / timestamping of a geography object attesting to?

Geographic data generally wouldn't be signed in isolation. Its a separate type because its large (compared to graphs) and could be reused between multiple graphs... A graph might be ~1kb vs a polygon that's 100-200kb depending on the size of land and sampling resolution.

aaronc avatar Jan 21 '21 19:01 aaronc

So one thing I'm thinking is where Msg/Sign should really only apply to graphs... because other formats can't necessarily be implied to have implicit meaning. Except maybe PDFs?

aaronc avatar Jan 22 '21 21:01 aaronc

So one thing I'm thinking is where Msg/Sign should really only apply to graphs... because other formats can't necessarily be implied to have implicit meaning. Except maybe PDFs?

Makes sense. I think I agree with this logic. I'd rather not have exceptions for specific media types (e.g. PDF), but rather just choose one strategy & stick with it.

In general my inclination is to pick something that satisfies our most basic needs & use cases, and iterate later as our needs evolves.

@aaronc would you like to include MsgStoreRDFData and MsgStoreGraphData definitions in this PR as well? Or tackle those in a follow up ?

clevinson avatar Jan 23 '21 00:01 clevinson

@aaronc would you like to include MsgStoreRDFData and MsgStoreGraphData definitions in this PR as well? Or tackle those in a follow up ?

Is it helpful to see the full picture? Is so then yes I can

aaronc avatar Jan 23 '21 00:01 aaronc

I made a few updates to show the bigger picture. There are a few more things that I have in mind which I'll try to get to later today

aaronc avatar Jan 25 '21 15:01 aaronc

@clevinson @blushi so the thing I was trying to explain there at the end is that right now I have three ways of representing each piece of data:

  • IRIs - these exist in the actual graphs when users see them and when they're canonicalized
  • uint64 IDs - these would be auto-generated for all IRIs including ones like https://regen.network/schema#... and are used to save space when storing graphs on chain and sending Msgs
  • ContentHash this is really a descriptor for what gets serialized into the IRIs

We could maybe get rid of ContentHash from most APIs and just have users construct IRIs themselves, although I think it's useful to have ContentHash to explicitly construct IRIs with the right fields set.

The uint64 IDs feel useful to reduce storage space and just having fixed 8-byte index keys will make managing the KV store a lot simpler. We could maybe mostly remove them from user-facing APIs and just use IRIs and ContentHashs, but I do like the idea of this built-in compression which is why this API uses uint64s pretty often.

The workflow for constructing one fo these compressed Graph data structures which uses uint64 IDs shouldn't be so bad. Here's what someone would do:

  1. create a dataset in JSON-LD and compute the URDNA2015 hash on the client side
  2. call Query/ConvertToCompactDataset with the JSON-LD content and this will return a Graph.Dataset with uint64 IDs filled in
  3. (if needed) If some IRIs are not registered, Query/ConvertToCompactDataset will return a list of those IRIs and instruct the user to call MsgRegisterIRIs with those IRIs. Then the user will be able to call Query/ConvertToCompactDataset without errors
  4. call Msg/StoreGraphData with the compact Graph.Dataset and the URDNA2015 hash generated client side encoded into ContentHash.Graph. This way there is efficiency at the storage and transport layer with no need for the user to fully trust a node because of the client-side URDNA2015 hash
  5. Msg/StoreGraphData will return a uint64 ID which the user can use for other calls like Msg/ValidateGraph if needed. The IRI internally representing this graph can also be retrieved with Query/IRIsByID.

Does that seem reasonable? Thoughts on this API design and usage of these three types of identifiers in different places?

aaronc avatar Jan 26 '21 17:01 aaronc

Also in terms of URI construction, we could also just always use an extension and hide everything else beneath base58 encoding, so for a dataset we could actually have regen:6EqBg1puBvfGDxMSbxah9R.rdf instead of regen:g/6EqBg1puBvfGDxMSbxah9R. We could maybe use a .geo or .wkt extension for geography and the regular extension for other raw data types... Then maybe IRIs would be a bit more consistent. What do you think?

aaronc avatar Jan 26 '21 17:01 aaronc

The uint64 IDs feel useful to reduce storage space and just having fixed 8-byte index keys will make managing the KV store a lot simpler. We could maybe mostly remove them from user-facing APIs and just use IRIs and ContentHashs, but I do like the idea of this built-in compression which is why this API uses uint64s pretty often.

Having users manage the cognitive overhead of two separate IDs as well as manually registering all IRIs feels like a lot of complexity to be introducing to app & client developers. If the concern is just about storage, why can't we just use something under the hood that doesn't get exposures to users at all (similar to how we handle protobuf any compression in sdk).

I feel like we're already asking a lot of end users to be learning the world of RDF / SHACL (and for good reason), and would love to keep additional complexity / DevX burdon to a minimum.

Can we get most of the performance / storage benefits you desire by taking a more under-the-hood approach to compression?

clevinson avatar Jan 27 '21 22:01 clevinson

Well we can't compress the transactions that are in the tendermint block store. And compared to protobuf, these IRIs make up the primary content of RDF. And even if we got rid of the uint64 IDs, I wouldn't feel comfortable directly parsing json-ld or turtle in the state machine. I mean maybe we could, but it wouldn't necessarily be less work to implement and I want to limit on chain state machine logic to the minimum. A shacl subset already feels relatively complex. So I think there would be a conversion step from json ld to another format either way.

aaronc avatar Jan 27 '21 23:01 aaronc

One thing we could do is only use uint64 IDs in the compact graph format for StoreGraphData and other methods would only use ContentHash or the IRIs. Maybe as a starting point I can make that change. Wdyt?

aaronc avatar Jan 27 '21 23:01 aaronc

Actually I think I've thought of a way to hide most of the complexity of the uint64 IDs. I'll update this PR with the new design soon.

aaronc avatar Jan 27 '21 23:01 aaronc

  1. create a dataset in JSON-LD and compute the URDNA2015 hash on the client side
  2. call Query/ConvertToCompactDataset with the JSON-LD content and this will return a Graph.Dataset with uint64 IDs filled in
  3. (if needed) If some IRIs are not registered, Query/ConvertToCompactDataset will return a list of those IRIs and instruct the user to call MsgRegisterIRIs with those IRIs. Then the user will be able to call Query/ConvertToCompactDataset without errors
  4. call Msg/StoreGraphData with the compact Graph.Dataset and the URDNA2015 hash generated client side encoded into ContentHash.Graph. This way there is efficiency at the storage and transport layer with no need for the user to fully trust a node because of the client-side URDNA2015 hash
  5. Msg/StoreGraphData will return a uint64 ID which the user can use for other calls like Msg/ValidateGraph if needed. The IRI internally representing this graph can also be retrieved with Query/IRIsByID.

So now I've simplified this to the follow:

  1. create a dataset in JSON-LD and compute the URDNA2015 hash on the client side
  2. call Query/ConvertToCompactDataset with the JSON-LD content and this will return a CompactDataset
  3. call Msg/StoreGraphData with the compact CompactDataset and the URDNA2015 hash generated client side encoded into ContentHash.Graph. This way there is efficiency at the storage and transport layer with no need for the user to fully trust a node because of the client-side URDNA2015 hash
  4. Msg/ValidateGraph now uses ContentHash.Graph to identify the data and shapes graphs, with optional named graph strings in case we're pointing to a graphs within datasets

So now:

  • all of the Msg APIs now use ContentHash.Graph to refer to graphs
  • the uint64 IDs are hidden behind CompactDataset for most use cases, with a few utility methods which most users wouldn't need to use
  • IRIs only get exposed on the query side

Let me know what you think @clevinson.

One thought I have is whether the IRIs should play a role at all in the Msg APIs - maybe in the return types, but maybe it's fine as it is and they mostly just get exposed on the query side of things.

aaronc avatar Jan 28 '21 01:01 aaronc

This is on hold until we revisit on-chain storage

clevinson avatar Nov 10 '21 17:11 clevinson