iroh icon indicating copy to clipboard operation
iroh copied to clipboard

feat(iroh-bytes): Batch blob api

Open rklaehn opened this issue 1 year ago • 5 comments

Description

This adds a new API for creating blobs.

You create a batch. Within a batch you got all the usual operations to add stuff, like add_bytes, add_stream, add_from_path etc. Notable differences to the existing api:

  • All operations work on individual blobs, so no way to add entire subdirectories with add_from_path
  • All operations return a TempTag instead of having a set tag option

The way to use the API is to just perform a complex operation within a batch, and then at the end assign a (non temporary) tag to the root(s) of the created data before dropping the batch.

It is possible to scan a directory and create a collection purely on the client side, so the code to traverse a directory can be removed from the node.

To allow the workflow described above, the tags client has been extended to allow manually setting a tag.

Ideally this API would entirely replace the current blobs API, so all read ops would always happen within the context of a batch.

Breaking Changes

At this point mostly adding stuff, but changing the RPC api for setting tags. I might decide to remove the entire non batch mutation API in this PR.

How it works is to leave a streaming RPC call open for each batch, then do operations in the context of an unique identifier for this RPC call.

Notes & open questions

Note: if things work out every add operation refers to a single blob, and the aggregation of many blobs can be driven from the client. That means that a lot of the complexity of the progress events like ids etc. can be removed from the rpc. This still needs to exist, but can be confined to the client.

Todo

  • [ ] ~~Add back fine grained progress~~ fine grained progress is not needed for all ops, but definitely for add_file and add_dir. Possibly for add_bytes and add_reader. Not sure if it is OK to have it in all cases despite it typically not being used.
  • [ ] ~~Purge all tag setting stuff from the blobs API and the downloader.~~

I would propose that we merge this initially as an addition, and do the stripdown of the other APIs in a subsequent PR. Also, if this is an addition we can do the fine grained progress for add_dir in a subsequent PR as well.

Change checklist

  • [x] Self-review.
  • [x] Documentation updates if relevant.
  • [x] Tests if relevant.
  • [x] All breaking changes documented.

rklaehn avatar Jun 03 '24 14:06 rklaehn

Here is the preliminary API for batches:

pub async fn add_bytes(&self, bytes: impl Into<Bytes>, format: BlobFormat) -> Result<TempTag> {
pub async fn add_file(&self, path: PathBuf, import_mode: ImportMode, format: BlobFormat) -> Result<(TempTag, u64)> {
pub async fn add_dir(&self, root: PathBuf, import_mode: ImportMode, wrap: WrapOption) -> Result<TempTag> {
pub async fn add_collection(&self, collection: Collection) -> Result<TempTag> {
pub async fn add_stream(&self, mut input: impl Stream<Item = io::Result<Bytes>> + Send + Unpin + 'static, format: BlobFormat) -> Result<TempTag> {
pub async fn add_blob_seq(&self, iter: impl Iterator<Item = Bytes>) -> Result<TempTag> {
pub async fn temp_tag(&self, content: HashAndFormat) -> Result<TempTag> {

Basically very similar to the normal blobs api, but there are no options to create tags. Instead every fn returns a temp tag for the thing that has been created, that the user can then later assign to a permanent tag (or not).

The tags API has been extended to allow creating a tag given a hash and format.

Many of these functions are convenience functions. Probably most notably, add_dir is now traversing the file system on the client side and doing multiple add_file calls.

rklaehn avatar Jun 04 '24 09:06 rklaehn

should there be the equivalent delete versions as well`

dignifiedquire avatar Jun 04 '24 09:06 dignifiedquire

should there be the equivalent delete versions as well`

WDYM? You delete stuff by ensuring that it is no longer tagged in some way, then GC will take care of it. There is blob delete, but that is really a low level function that you should rarely use directly.

delete_blob will just do it's thing no matter what temp tags there are, so it can live in the blobs API not in the batch API.

rklaehn avatar Jun 04 '24 09:06 rklaehn

@rklaehn what's the state of this?

dignifiedquire avatar Jun 28 '24 09:06 dignifiedquire

@rklaehn what's the state of this?

Just merged with main. I want to do another self-review, but currently trying to keep all the stuff up to date with main so it does not bitrot...

rklaehn avatar Jun 28 '24 09:06 rklaehn

closing in favor of #2545

dignifiedquire avatar Jul 29 '24 10:07 dignifiedquire