rust-ipfs icon indicating copy to clipboard operation
rust-ipfs copied to clipboard

Remove `ipld` module in favour of `libipld` dependancy

Open chunningham opened this issue 2 years ago • 8 comments

This PR replaces the IPLD module of rust-ipfs with that of libipld. This was proposed in issue #54, but the issue was closed (I believe upon the emergence of ipfs-embed). This change would:

  • Improve the developer experience of implementing IPLD-based data models/applications which are independent of IPFS implementation
  • Reduce maintenance burden
  • Improving the developer focus on a single IPLD implementation
  • Allow developers building on rust-ipfs to make use of the improved functionality found in libipld

Current status: all changes required to build the project have been made. Some tests (specifically in unixfs) fail with Protobuf(UnexpectedEndOfBuffer) while writing to a buffer, which is being investigated. There are probably more areas which can be improved by deduplicating some code (e.g. dag and unixfs can benefit from using some libipld functionality instead of re-implementing it, particularly relating to manual protobuf manipulation).

Checklist

  • [x] Bitswap consumes libipld and does not re-implement Block
  • [x] Cid, Multihash, Multibase and other IPLD related implementations now come from libipld and are up to date
  • [x] ipld module entirely removed
  • [ ] ensure dag and unixfs don't contain functionality which is duplicated in libipld
  • [x] no performance regressions on benchmarks
  • [ ] ensure compatibility with Go and JS implementations
  • [ ] Additions to CHANGELOG.md files

chunningham avatar Jan 31 '22 11:01 chunningham

At one point I remember we adopted the current cut down cbor implementation for if I can remember correctly the ipld imposing the base traits upon the whole system, like how to store blocks. Has that changed?

I remembered of a similar issue; Cid/Multihash moving to using large allocationless storage instead of small Bytes allocations. Has there been any development on that front? I had a mental TODO on to investigate how this would change things for example for the unixfs, investigate other Cid implementations.

While using prost in unixfs would make sense in my mind instead of quick-protobuf, I don't think it is a good idea to build such functionality on top of ipld object model, which I assume this PR is planning to.

koivunej avatar Jan 31 '22 13:01 koivunej

Cid/Multihash moving to using large allocationless storage instead of small Bytes allocations.

This would indeed be a problem. The current versions of rust-multihash and rust-cid use fixed sized arrays. Though IPFS needs support arbitrarily sized hashes in order to support large identity Multihashes. One way would be to create an enum that wraps a rust-cid. It would have two variant, one for the arbitrarily sized (or something around the maximum block size) Multihash and all other ones, which then would be a rust-cid.

vmx avatar Jan 31 '22 13:01 vmx

One way would be to create an enum that wraps a rust-cid

This solution would still use even more space than the current rust-cid, as it would now need a discriminator on top of the rust-cid and this some smaller variant. I remember thinking something of the sort when the const-generic solution came about it's great if you are only producing content, not perhaps so much as when you are consuming content or following links, keeping a stack of Cids in memory.

koivunej avatar Jan 31 '22 14:01 koivunej

Yes. I also though if it would perhaps make sense to have CID implementation that is not optimized for no_std environments and would just use heap allocations and hence be way simpler than the current rust-cid and rust-multihash.

vmx avatar Jan 31 '22 14:01 vmx

@koivunej, libipld does indeed have a Store trait, which seems intended to act as the API for any IPFS implementation (assuming that's what you're referring to), however it's IMO not necessary to conform to it in this PR without good reason.

Similarly, while IMO it might be more maintainable to refactor unixfs to use libipld extensively, it's not strictly necessary, although I am curious about why it actually might be a bad idea (not familiar with its development history).

W.r.t. CIDs and Multihashes, I'm not well versed in exactly why things are the way they are implementation-wise, but the benchmark has improvements of ~20% for 1000 elements and ~50% for 10000 elements. Is this indicative of improvements in more recent versions of these crates?

chunningham avatar Jan 31 '22 14:01 chunningham

Similarly, while IMO it might be more maintainable to refactor unixfs to use libipld extensively, it's not strictly necessary, although I am curious about why it actually might be a bad idea (not familiar with its development history).

With the unixfs I was trying to find a way for a library to be minimal in compile-time dependencies as well as runtime dependencies. In the end, unixfs crate did not need a tight dependency to any one store definition, it can just process bytes which are fed to it. This was an alternative to the libipld way of at least back then embedding the store trait and making everything generic over the store trait. That simply does need to be taken into consideration at a unixfs library level.

Similarly, while IMO it might be more maintainable to refactor unixfs to use libipld extensively, it's not strictly necessary, although I am curious about why it actually might be a bad idea (not familiar with its development history).

Unixfs is just protobuf reading and writing, you don't need a central ipld object model.

W.r.t. CIDs and Multihashes, I'm not well versed in exactly why things are the way they are implementation-wise, but the benchmark has improvements of ~20% for 1000 elements and ~50% for 10000 elements. Is this indicative of improvements in more recent versions of these crates?

I'm not following what benchmark you are talking about. Could you link to it?

koivunej avatar Jan 31 '22 15:01 koivunej

That approach makes a lot of sense IMO, I was just thinking about using libipld to handle the protobuf parts to reduce dependancies and to make some things clearer (e.g. Block instead of (Cid, Vec<u8>)). I haven't looked closely yet at any deeper changes to unixfs.

I was referring to benches/hashed_map_cid.rs, although the unixfs/benches scenarios show a consistent ~-2.5% performance loss on this branch in all of them.

chunningham avatar Jan 31 '22 16:01 chunningham

the unixfs/benches scenarios show a consistent ~-2.5% performance loss on this branch in all of them.

I don't think dhat-rs existed back then when the ingest-tar was created, but I'd be surprised if you don't see any difference in memory usage.

(Cut a very long answer on benchmarking difficulties but maybe I should start up a blog once more :))

For the Ipld specific operations much more efficient ones could be implemented without the Ipld object model, as in, guiding the deserialization on the interesting path or links for example for references, similar to ipfs-unixfs. (Agreed, this is much easier done for ipfs-unixfs than dag-cbor or -json, but still the point stands.)

Users should be able to integrate with libipld without it's inclusion by implementing the Store (if that is even necessary). If there's something preventing that, we should fix it.

Another aspect which makes me uneasy about including a such a heavy dependency is that we are already heavily dependent on the rust-libp2p, and upgrading it already takes a long while. If we'd add more such dependencies, which would sometimes conflict, of equally great importance, it would make everything a balancing act between PR'ing different repos and coordinating for releases. We ended up in adopting the dag-cbor and -json to ease this sitatuation.

koivunej avatar Jan 31 '22 17:01 koivunej