go-ipld-prime API design review: linking: can we make it simpler?

trafficstars

Building links is currently... significantly twisty. How much of this is essential complexity and how much of it is accidental (or at least, avoidable, in at least some common cases)?

There are several reasons links are the way they are right now. Maybe some of them can be ditched as axioms. But let's identify them:

Links are an interface in the core package of go-ipld-prime... because they're an important core concept. Nuff said.
Links do not have a concrete implementation in the core package of go-ipld-prime... for several reasons:
- go-cid is not a pleasant transitive dependency. It causes a staggering increase in transitive dependency count as well as in final binary size, and I don't like it. If you import the packages related to CIDs, you get this dependency (and all its transitives); if you don't, you shouldn't.
- There are some unfinished discussions about extending the concept of linking in IPLD to support hash+path links. It's possible that the future resolution of this discussions will result in the necessity of an interface here anyway... and the one we've got now would suit for this.
- Conceptually, IPLD needs some concept of linking. And CIDs are such a concept. And the same people do maintain both of these things. But, honestly, the interface tells the truth: links just need a way to be loaded. There's no other fixation.
Links are a concept that has parameters. This necessitates a concept of a LinkBuilder, which can gather them into one place, so parts of the code that don't care about the parameters can handle them opaquely.
- CIDs have numerous parameters: the CID version, the multicodec, the multihash, and the multihash length.
- Traversals that want to build updates to a graph and simply keep using whatever kind of links were used before are an example of something that needs to handle link parameters opaquely.
Link creation and writing out the serial data are the same process.
- Creating a link requires hashing, and hashing requires a linearized sequence of bytes. There's no sense in doing the serialization for hashing and for storing separately, so, the interface makes both happen at the same time.
Writing out the serial data involves a series of callbacks (namely, ipld.Storer, which in turn must return an ipld.StoreCommitter).
- This is done so that writes can be streaming. We don't want the API to shoehorn us into a position where the whole data has to be stored in memory twice (e.g., in the reified Node from, and a second time in the serial about-to-be-but-not-yet-written form). This requires a two-phase approach, because (assuming storage is organized by hash) the hashing can't finish until the end; StoreCommitter is interface that results.
- StoreCommitter is necessarily quite different than just turning this all into a io.ReadCloser because it needs to take a Link parameter (in practice, this typically is used to do an mv operation to a filename path based on the link hash, or similar).
There's an ipld.LinkContext object that carries some additional data into the linking process.
- I want this Because Of Reasons -- I have an application in mind that would like to store different pieces of data in different directories depending on what type they are, even though inside each of those directories it still uses content-addressable path sharding, and LinkContext.Node makes it possible for me to do that.
- It's also quite nice for debugging and logging. Being able to see LinkContext.Path when in the middle of a large traversal can be very useful.
- There are other uses to this I probably haven't even imagined. Off the top of my head: Say one wanted to do a depth-limited traversal, for example: this could be implemented by having an ipld.Loader that returns traversal.SkipMe when LinkContext.Path gets beyond a certain length. (IPLD Selectors can be used to do this too, but maybe some sufficiently complex algorithm actually wants the callback here in order to do unusually interesting logic.)

Altogether, these concerns have not let us to a simple place.

Issue https://github.com/ipld/go-ipld-prime/issues/46 also discusses some other potentially interrelated concerns with how links works (but has a slightly different focus; it's mostly about how another application that's currently using go-cid concretely interacts with all this -- important, but narrower).

Potential ways forward:

go-cid could be reformed to have fewer transitive dependencies. This would make me happy. (It would also be a pretty significant undertaking, because a LOT of things rely on that repo; changes cannot be made willy-nilly.)
The interface satisfactions necessary for go-cid to serve as an ipld.Link (and ipld.LinkBuilder, etc) for go-ipld-prime could be moved to the go-cid repo/package, rather than living in go-ipld-prime/linking/cidlink. This would remove the cidlink compatibility package entirely, thus reducing the number of packages in play, and generally be significantly simpler. (This should be tractable; it just requires cross-repo coordination, so we want to be choosy about when we tackle it.)
Should we have a helper function for creating links but just /dev/null'ing the serial data? And/or another function for doing the same but returning a bytes.Buffer or just plain []byte? This would let a user get started with linking without being forced into immediately understanding the whole Storer+StoreCommitter situation. (This seems like a hard 'yes', and it's easy enough to do as a plain function that just supplies dummy thunks for most of the arguments of link building.)
Is... entangling the hashing for link creation and the output writer for the serialized content a mistake? (I really don't know how else this could be done, but maybe it's worth spending some more time trying to be imaginative.)
Is it reasonable for the loader and storer interfaces to assume you might be doing network IO, and thus have the context.Context parameter in play around all this? Maybe that's a proportionally rare need, and when it does come up, we could make users get their context in via closure over it in their callbacks? (I'm a bit afraid this might cause the use of more closures that cause more allocations to bind the context object, but maybe this isn't significant.)
Are there better ways to arrange this code that have the same semantics but is less confusing? (That's an open one. Specific ideas welcome, if someone's looking and spending braincycles here.)
Other? Your ideas here.

May 03 '20 05:05 warpfork

I'm thinking it might be nice to move some of the concepts to the go-ipld-prime/linking package -- Storer, Loader, StoreCommitter, and LinkBuilder all seem to make sense nudged over that way -- and then in tandem make a lot of helper methods in that package that string them all together in the most common ways.

The Link.Load method in particular could use a lot of attention, too. I think that has a significant amount of accidental complexity, and maybe together with more helper functions gathered in a linking package, we could iron it out. As it stands, implementers of that Link.Load method do far too much work: it's connecting all of the IO parts from the Loader to a codec (and all reference to the codec is internal) and binding the codec to the NodeAssembler, in addition to binding the IO stuff to a tee to a hasher. The last part of that is somewhat necessarily link-implementation-specific. The rest seems like it should be potentially pretty common and extractable.

The Link.Load method versus the ipld.Loader interface is also just a painfully confusing naming.

... on the other hand, I see why Link.Load got the way it did. For the cidlink.Link.Load implementation, that Load function is diving into the multicodec table. (That's also why it can't be trivially moved to the go-cid package -- it's not acceptable for go-cid to start having concrete references to codec implementations, either.) That might be tricky to detangle.

Looks like reviewing and rigorizing more interfaces around the codecs might actually be the first prerequisite step to further simplifications of what's going on in Link.Load.

May 03 '20 06:05 warpfork

For future quick reference -- the interfaces and types discussed above currently look like this (comments stripped for brevity):

type Link interface {
	Load(context.Context, LinkContext, NodeAssembler, Loader) error
	LinkBuilder() LinkBuilder
}
type LinkBuilder interface {
	Build(context.Context, LinkContext, Node, Storer) (Link, error)
}
type Loader func(lnk Link, lnkCtx LinkContext) (io.Reader, error)
type Storer func(lnkCtx LinkContext) (io.Writer, StoreCommitter, error)
type StoreCommitter func(Link) error
type LinkContext struct {
	LinkPath   Path
	LinkNode   Node
	ParentNode Node
}

May 03 '20 06:05 warpfork

I've drafted a rethink of this over in a gist: https://gist.github.com/warpfork/c0200cc4d99ee36ba5ce5a612f1d1a22

The tl;dr of it is that introducing a LinkSystem type and using it to hold a bunch of functions which choose codecs, hashers, and storage, seems to lead in a good direction: it's both clear where things compose (it's all centered on LinkSystem!), and, it means we have a place where we can put lots of helper methods that do the things end users want to do over the composed codecs+hashers+storage.

A lot of the rest of the earlier discussion and considerations still apply. LinkContext is still in there, for example. I simplified that a bit, though, by moving context.Context inside of LinkContext, so the total number of parameters to wrangle decreased. Also, it was possible to add helper methods that do a default LinkContext (and set it up with context.Background, etc) to the LinkSystem type, even further reducing the amount of user exposure to this detail.

Not sure if this will be a final draft to move forward with, but it seems like it's going in a good direction.

Sep 18 '20 12:09 warpfork

Could you show a quick example of how the simplest use of the new API would look like? A diff on the "intro to Go" I wrote, for example, could be a good way to visualize this.

Sep 22 '20 10:09 mvdan

go-ipld-prime go-ipld-prime copied to clipboard

API design review: linking: can we make it simpler?

go-ipld-prime
go-ipld-prime copied to clipboard