go-ds-s3 Add key transform functions for massive rate limit improvements

Without sharding in S3, it is unreasonable from a performance and cost perspective to use it as an IPFS datastore.

This PR adds customizable sharding methods, matching the implementation of go-ds-flatfs and compatible with js-datastore-core (also js-datastore-s3).

This incorporates the suggestion made by @whyrusleeping https://github.com/ipfs/go-ds-s3/pull/195#issuecomment-900466233 to allow developers to decide the layout of their data on S3.

Nov 12 '21 13:11 v-stickykeys

Thank you for submitting this PR! A maintainer will be here shortly to review it. We are super grateful, but we are also overloaded! Help us by making sure that:

The context for this PR is clear, with relevant discussion, decisions and stakeholders linked/mentioned.
Your contribution itself is clear (code comments, self-review for the rest) and in its best form. Follow the code contribution guidelines if they apply.

Getting other community members to do a review would be great help too on complex PRs (you can ask in the chats/forums). If you are unsure about something, just leave us a comment. Next steps:

A maintainer will triage and assign priority to this PR, commenting on any missing things and potentially assigning a reviewer for high priority items.
The PR gets reviews, discussed and approvals as needed.
The PR is merged by maintainers when it has been approved and comments addressed.

We currently aim to provide initial feedback/triaging within two business days. Please keep an eye on any labelling actions, as these will indicate priorities and status of your contribution. We are very grateful for your contribution!

Nov 12 '21 13:11 welcome[bot]

I wasn't able to test this unless the module name matched my repo name so that is one thing I will change (may need some guidance to figure out how to do this properly as I'm new to go!)

Nov 12 '21 13:11 v-stickykeys

@MichaelMure : are you able to review this?

Nov 12 '21 16:11 BigLep

From conversations with @whyrusleeping , it's my understanding that the next-to-last/2 method (and other methods of filesystem sharding) were originally implemented in order to better scale linux based filesystems.

@achingbrain would you be able to share the reasons for sharding things out in the way you did in js-datastore-s3? Did you run into similar scalability issues or were you just trying to follow the existing filesystem sharding patterns of go-ipfs?

Also, @aschmahmann / @BigLep as we move towards go-ipfs 0.12.0, do any considerations need to be taken with the desire to "Switch to raw multihashes in datastores (including datastore version upgrade)" in the 0.12.0 upgrade? If possible, I'd love to make sure any possible compatibility issues are addressed pre-emptively here.

Nov 12 '21 16:11 obo20

@obo20 see https://github.com/ipfs/go-ds-s3/issues/105 and https://github.com/ipfs/go-ds-s3/issues/199 (specifically the second one).

Nov 15 '21 17:11 Stebalien

Oh, wait, sorry. You already know that.

For sharding blocks, "next to last" is probably fine. Although I'd personally just use a cheap hash function.

Nov 15 '21 19:11 Stebalien

would you be able to share the reasons for sharding things out in the way you did in js-datastore-s3? Did you run into similar scalability issues or were you just trying to follow the existing filesystem sharding patterns of go-ipfs?

@jacobheun did the initial implementation so perhaps he can speak to the reason, but I get the feeling it was just following the principles set out by other datastore implementations.

Nov 18 '21 08:11 achingbrain

@jacobheun did the initial implementation so perhaps he can speak to the reason, but I get the feeling it was just following the principles set out by other datastore implementations.

This is correct, it was just following the patterns of the other JS datastore implementations at the time (2018), it's just a lightweight interface to s3. @achingbrain I believe the sharding behavior changed in https://github.com/ipfs/js-datastore-s3/pull/30, but it's not managed by the s3 datastore itself.

Nov 18 '21 11:11 jacobheun

@MichaelMure I believe this is ready for review. PTAL when you get the chance!

Dec 29 '21 00:12 v-stickykeys

@BigLep given no word so far from @MichaelMure is there someone else who might be available to review this?

Jan 10 '22 15:01 v-stickykeys

Similar response as in https://github.com/ipfs/go-ds-s3/pull/195#issuecomment-1013311478

Core IPFS maintainers aren't planning to take this on as we aren't using this ourselves. We need to figure out a way to setup and denote community membership for this and other datastore repos. We're game to setup others as the maintainers here.

Maybe pair up with @obo20 and review each other's code (@obo20 has #195 open) or ask for other volunteers in the ipfs-operators Slack channel?

Jan 14 '22 17:01 BigLep

@v-stickykeys : are you going to incorporate the feedback? This looks like a good change that is worth landing.

Mar 25 '22 15:03 BigLep

@v-stickykeys : are you going to be able to incorporate the feedback? We'd like to support getting this landed.

Apr 22 '22 15:04 BigLep

Hey @BigLep the short answer is yes but honestly it hasn't been the top of my priority list and is now more of a weekend project. My PR as it stands meets our needs but I do have another branch with improvements addressing the review which I can try to get wrapped and pushed up in the next week.

Apr 26 '22 14:04 v-stickykeys

@v-stickykeys : just checking in. Maintainers can take a look if you post an new revision with incorporated comments.

Jun 03 '22 15:06 BigLep

This would be great to have!

Oct 13 '22 15:10 eth-limo

Hi guys.

I wrote the IPFS on AWS Fargate blog for AWS and I'm currently looking at making this S3 plugin work with the serverless architecture mentioned in the blog.

This would be a shame not to merge this.

S3 Sharding is key to making this a viable option compared to traditional drives. With IPFS we're looking for high throughput. Low latency data access is less important.

Referencing this post from @mikeal, if we can achieve 40GB/s throughput per bucket we could run large clusters at a fraction of the cost. We can leverage S3 infrequent access or intelligent tiering to save on cost even more.

I will try this branch and I'm willing to help move this forward.

@guseggert you've submitted multiple feedback. What is your assessment of the situation? Looks like there is minor work left to finalize this PR.

@v-stickykeys are you still around?

@BigLep can you add me as contributor to this PR?

Dec 03 '22 08:12 koxon

@koxon sorry that no one answered this... are you still interested in picking this up?

Right now I don't see any upsides in letting users configure the number of prefixes between next-to-last-2,3 or whatever... might as well set it to a number that 128 or a number that maximizes the usage of S3 and leave it there. Unless there is a good reason to use a worse number.

Oct 12 '23 19:10 hsanjuan

go-ds-s3 go-ds-s3 copied to clipboard

Add key transform functions for massive rate limit improvements

go-ds-s3
go-ds-s3 copied to clipboard