warp icon indicating copy to clipboard operation
warp copied to clipboard

Add --prefixes flag for GCS and pre-sharded bucket

Open dbishop opened this issue 1 month ago • 8 comments

Community Contribution License

All community contributions in this pull request are licensed to the project maintainers under the terms of the Apache 2 license. By creating this pull request I represent that I have the right to license the contributions to the project maintainers under the Apache 2 license.

Description

We add a new flag, "--prefixes value", which is mutually exclusive with --prefix and whose value is a comma-separated list of one or more static prefix values. The case of --prefix and --prefixes with one value specified are identical (and in fact, --prefix is now implemented as a single-element --prefixes internally).

When multiple prefixes are given with --prefixes, each object within each thread has a static prefix chosen from the list in a round-robin fashion. Also, the listings for --list-existing include all the prefixes, "chaining" the prefix listings together.

This change preserves the existing behavior of the --prefix and --noprefix flags.

The interactions of the flags are complicated, so a new unit test was added to cover the various combinations.

Motivation and Context

There are 2 use-cases for a static set of object prefixes ("folders"):

  1. GCS makes deleting "folders" difficult, at least with the heirarchical optimization feature enabled, and S3 API delete calls can't delete them. So if Warp generates a fresh one for every thread that ever runs, it makes cleaning up the bucket super tedious, even though the bucket has zero objects.
  2. Testing a major CSP's object storage involved having them pre-shard the bucket on known prefixes to ensure a valid test. We needed Warp to support this, so patched in the support. (That patch was hackier, and this is a better version of it, though it worked fine for our usage.)

Our use-case 1 is now satisfied by specifying --prefixes with the desired count of top-level "folders" and --noprefix.

Our use-case 2 is now satisfied with just --prefixes (though --noprefix doesn't hurt it).

A quick note about the copyright lines: our legal requirement is to include it in any new file and add to any files with "substantial (e.g. not a trivial bug-fix, but a contribution worthy of recognition of ownership)" modifications. We use our best technical judgement here. In this case, it's one file with a small new function, one file with the round-robin implementation, and the file with the new unit test. They don't change the licensing of the submitted code (Apache-2) or what Minio can do with it or the public later (via the AGPL license), and they don't affect any existing code's copyright.

How to test this PR?

The new unit test gives the primary confidence, though I did functionally test the new behavior with a three-phase set of warp invocations (put, get, delete) with --prefixes=00,01,...,99 with and without --noprefix and verified that the resulting bucket folders & objects in GCS behaved as expected. Without --noprefix, the extra per-client-thread sub folders were left lying around as expected and I had to clean those up using the GCS console.

Types of changes

  • [ ] Bug fix (non-breaking change which fixes an issue)
  • [x] New feature (non-breaking change which adds functionality)
  • [ ] Optimization (provides speedup with no functional changes)
  • [ ] Breaking change (fix or feature that would cause existing functionality to change)

Checklist:

  • [ ] Fixes a regression (If yes, please add commit-id or PR # here)
  • [x] Unit tests added/updated
  • [x] Internal documentation updated
  • [ ] Create a documentation update request here

dbishop avatar Nov 20 '25 15:11 dbishop

@klauspost , I left the flag non-hidden as even though this one mentions GCS, but the other use-case was at the same tier-1 CSP where we hit the uint16 thread ID limit and needed to hit static pre-sharded top-level "folders" in our test bucket.

Let me know if you still think it should be hidden.

Thanks!

dbishop avatar Nov 20 '25 15:11 dbishop

Honestly, this seems a crazy amount of code for an extremely specific use-case.

klauspost avatar Nov 20 '25 16:11 klauspost

That's fair. I think the use-cases are valid, at least for using Warp for large-scale testing with CSPs. For the code volume & complexity, that naturally arose from maintaining seamless backward compatibility and not changing any existing flags or their semantics.

I think the additional unit test mitigates the complexity's risk. How do you see these portions of code changing over time (i.e. how would this extra code in this patch potentially affect development velocity in the future)?

I'd really like to not have to carry this patch around on our side, and I think it provides some value to all warp users (again, in some specific circumstances).

Is there a simpler way for Warp to support these 2 use-cases?

Thanks!

dbishop avatar Nov 20 '25 17:11 dbishop

@klauspost, Hi, I did find an additional improvement to be had in refactoring the logic for list-existing between bench.{delete,get,stat} and did so with a helper function and options struct. I think that's a strict improvement in complexity and duplicated logic and offsets the fact that the single location of list-existing logic now handles multiple prefixes.

I also rebased the PR branch to keep it clean.

Thanks!

dbishop avatar Nov 24 '25 15:11 dbishop

@klauspost Added one more tweak to clean up the separation of concerns between generator.{generator,random}. Now all the prefix-calculation is in generator.random, not spread across the two files. This will be simpler and easier to maintain.

dbishop avatar Nov 24 '25 19:11 dbishop

@klauspost Hi, any thoughts on this patch after the 2 adjustments last week?

Thanks, Darrell

dbishop avatar Dec 01 '25 21:12 dbishop

@dbishop Honestly I am not convinced, and while I do appreciate the effort without knowing more it seems to add a huge amount of complexity A) For a very edge case and B) That adds a lot of code for us to maintain.

I will leave the decision to @harshavardhana - but my overall thought is that this is something you should keep in a fork.

klauspost avatar Dec 02 '25 08:12 klauspost

@klauspost @harshavardhana Hedging my bets on this PR, I created #434 to just cover the --list-existing refactoring. I think that's an objective improvement in maintainability, even if not considering this patch.

dbishop avatar Dec 02 '25 16:12 dbishop

Closing as per ^^^

klauspost avatar Dec 18 '25 16:12 klauspost