s5cmd
s5cmd copied to clipboard
command/storage: add versioning support
This commit adds versioning support to the s5cmd.
- Added
--all-versions
flag tols
,rm
,du
andselect
subcommands to apply operation on(/over) all versions of the objects. - Added
--version-id
flag tocat
,cp
/mv
,rm
,du
andselect
subcommands to apply operation on(/over) a specific versions of the object. - Added
bucket-version
command to configure bucket versioning. Bucket name alone returns the bucket versioning status of the bucket. Bucket versioning can be configured withset
flag which only accepts. - Added
--raw
flag tocat
andselect
subcommands. It disables the wildcard operations.
Note: Google Cloud Storage uses a different approach for versioning. So with current implementation, s5cmd cannot use or retrieve generation numbers . However, bucket-version
command and du
command with all-versions
flag works as expected since they do not use version ids.
Fixes: #218 Fixes: #386
Status as of July 26 (Outdated):
- add all-versions flag to following subcommands:
- [x] ls
- [ ] rm (
only with wildcards,does not delete delete markers) - [x] du
- add
version-id
flag to following sub commands:- [ ] cp/mv
- [x] cat
- [x] rm
- [x] du
- format outputs
- [ ] ls ...
Background
You may refer to https://github.com/peak/s5cmd/issues/386#issuecomment-1176069705 for background of changes
Current problem (I'm trying to solve):
rm
uses expandSource method which handle keys differently when wildcards are used (or not). So It doesn't work when all-versions
of a particular key was to be deleted, it just put a delete marker, though rm succesfully deletes objects when wildcards are. used.
To fix this, we need to pass value of "all-versions" flag expandSource. Hence I propose to
- put value of all-version flag as a field to the URL (instead of passing to s3 object and using it in s3.List method).
- I want to change the type of src & dst fields of commands (Copy, Delete etc.) to URL (from string) accordingly.
Alternatively, we can add new parameters to expandSource method to pass all-versions flag.
Note You can refer to Kucukaslan@df94602 to see what kind of changes I'm intended to do in code as rm.go and Delete being an example.
Example usage syntax
s5cmd ls --all-versions s3://bucket/
s5cmd rm --all-versions "s3://bucket/*"
s5cmd rm --all-versions s3://bucket/key
s5cmd du --all-versions "s3://bucket/*"
s5cmd cat --version-id smUtf8Thng s3://bucket/key
s5cmd du --version-id smUtf8Thng s3://bucket/key
s5cmd rm --version-id smUtf8Thng s3://bucket/key
Up to date status
I've made the changes to the Command objects and url.URL I mentioned earlier.
Implementation
- add
all-versions
flag to following subcommands:- [x] ls ( including delete markers)
- [x] rm ( including delete markers)
- [x] du
- [x] select
- add
version-id
flag to following subcommands:- [x] cp/mv
- [x] cat
- [x] rm
- [x] du
- [x] select
- Added
bucket-version
command to configure bucket versioning.- [x] get status
- [x] set bucket versioning
Output formats
- [x] cp/mv : Didn't change. It is ambiguous that whether version-id that should be printed belongs to the source or destination.
- [x] cat: Didn't change. It should print the content of the file
- [x] du: Didn't change. It is generally used for multiple objects and return their total disk usage.
ls
can be used withall-versions
to see sizes of each version. - [x] select: Didn't change. It should print the result of the query.
-
all-versions
flag :- [x] ls ( including delete markers)
Example
2022/08/10 09:53:03 3171 log/log.go 3/60O30C1G60O30C1G60O30C1G60O30C1G60O30C1G60O30C1J0109VB3PCPEL0PO= 2022/08/10 09:53:28 23 log/log.go 3/60O30C1G60O30C1G60O30C1G60O30C1G60O30C1G60O30C1K01TF8AJ9OV6NF7O= {"key":"s3://mcks5cmd/log/log.go","etag":"b96979fea4ce57766596e47d1b6cc5e1","last_modified":"2022-08-10T09:53:03.124Z","type":"file","size":3171,"storage_class":"STANDARD","version_id":"3/60O30C1G60O30C1G60O30C1G60O30C1G60O30C1G60O30C1J0109VB3PCPEL0PO="} {"key":"s3://mcks5cmd/log/log.go","etag":"05f2faf2442033698d1aa6778ca70c1b","last_modified":"2022-08-10T09:53:28.325Z","type":"file","size":23,"storage_class":"STANDARD","version_id":"3/60O30C1G60O30C1G60O30C1G60O30C1G60O30C1G60O30C1K01TF8AJ9OV6NF7O="} - [x] rm ( including delete markers)
Example
rm s3://mcks5cmd/log/log.go 3/60O30C1G60O30C1G60O30C1G60O30C1G60O30C1G60O30C1J0109VB3PCPEL0PO= rm s3://mcks5cmd/log/log.go 3/60O30C1G60O30C1G60O30C1G60O30C1G60O30C1G60O30C1N03KIF6K5MOK5168= {"operation":"rm","success":true,"source":"s3://mcks5cmd/log/log.go","version_id":"3/60O30C1G60O30C1G60O30C1G60O30C1G60O30C1G60O30C1J0109VB3PCPEL0PO="} {"operation":"rm","success":true,"source":"s3://mcks5cmd/log/log.go","version_id":"3/60O30C1G60O30C1G60O30C1G60O30C1G60O30C1G60O30C1N03KIF6K5MOK5168="}
- [x] ls ( including delete markers)
-
version-id
:- [x] rm
Example
rm s3://mcks5cmd/log/log.go 3/60O30C1G60O30C1G60O30C1G60O30C1G60O30C1G60O30C1K01TF8AJ9OV6NF7O= {"operation":"rm","success":true,"source":"s3://mcks5cmd/log/log.go","version_id":"3/60O30C1G60O30C1G60O30C1G60O30C1G60O30C1G60O30C1J0109VB3PCPEL0PO"}
- [x] rm
Tests
- [x] prepare versioning setup for gofakes3
- testing
all-versions
flag- [x] ls & rm
- [x] du
- testing
version-id
flag- [x] cp/mv
- [x] cat
- [x] rm
- [x] du
- command validations
- [x] cannot use both of the flags together
- [x] this flags are only meaningful to remote files
- [x] make validation checks reusable
Google Cloud
Warning Google Cloud Storage uses a different approach for versioning. So with current implementation, s5cmd cannot use or retrieve generation numbers . However, with
all-versions
flagdu
works as expected since it does not use version ids,ls
lists object metadata except the generation numbers etc.
Commentary & Known Issues & Discussion Topics
-
gofakes3
package that we use in our tests supports versioning only with in memory backend, so I've added another method to setup fake server. - There
iswas a bug when I try to delete from gofakes3 server usings5cmd rm
. Despite usingversion-id
/all-versions
flags, the server does not permanently delete the corresponding objects and just adds delete marker to them. Interestingly:- this bug does not happen when I use
aws s3api delete-object
to connect gofakes3 server. - this bug does not happen when I use
s5cmd rm
to connect real AWS S3 server. - other subcommands of s5cmd works as expected. I'm currently trying to identify root cause of this bug and to fix it.
- this bug does not happen when I use
Note It turned out that gofakes3 does not support multidelete for versioned objects. At the moment we've fixed it in igungor's fork with https://github.com/igungor/gofakes3/pull/6. Also we've a PR to fix it in upstream too https://github.com/johannesboyne/gofakes3/pull/69.
- I do not discern the objects and delete markers when
all-version
flag is used.- Should we show the distinction in outputs?
- Should we require yet another flag to take delete markers into account (and ignore them otherwise)
Note We continue not to discern objects and delete markers, in this case. No special flag.
- Both of the s3 keys and object versions have maximum length of 1024 byte (UTF-8 string). It, potentially, might require a lot of whitespaces to align VersionID and Key columns in output (especially because we don't know, in advance, what their respective maximum lengths will be. Should we apply adaptive alignment? I mean: Each column is aligned according to the longest element so far.)
Note We will only align key to left with fixed "50" (?) characters width and append the versionID (prefixed with a space) to it aws s3api prints out json mc has an example output here
Request for Comments
Configuring bucket versioning
Warning We decided to add "
bucket-version
command to configure bucket versioning. Removed bucket versioning related logic from the version command."^vers
add
set
andget
flags toversion
subcommand
Alternatively we can remove the get flag and use this syntax:
$ s5cmd version
v0.0.0-dev$ s5cmd version s3://bucket
Bucket versioning for "bucket" is "Enabled"$ s5cmd version --set Enabled s3://mcks5cmd
Bucket versioning for "bucket" is set to "Enabled"
ps. At the moment to get bucket versioning we need to write:
$ s5cmd version --get s3://bucket
JSON Unmarshall'ing storage.Object to display versionID
Warning We decided to add a VersionId field to storage.Object, just for this use case.
JSON Marshall should give version ids. But we marshal the storage.Object type https://github.com/peak/s5cmd/blob/3a49799e064477c49c252d4e807cc66de685c913/command/ls.go#L294 which does not have versionID field https://github.com/peak/s5cmd/blob/3a49799e064477c49c252d4e807cc66de685c913/storage/storage.go#L105-L114
On the Google Cloud Storage
It has generation numbers analogous to S3 Version Ids.
In its REST API it uses generation
tag while the AWS S3 uses VersionId
tag. So the (Un)Marshalers in AWS SDK does not handle generation tag. Hence it can neither get generation number from the response nor can send it with the request.
I don't think that intervening into Marshaler's logic via request handlers would be an acceptable/practical solution, even if it were to be possible.
As a last resort I've tried to modify AWS. SDK to add Generation
fields to relevant types[^fork]. It helped to read generation numbers in List request without breaking any other thing, that is ls --all-versions
worked with GCS.
However, it still failed to use those generation numbers in requests, that is --version-id
flag and rm/cp... --all-versions
did not work. I've tried a few other modifications to SDK, but none of them worked with GCS without breaking the AWS S3.
Even if these attempts were to be succesfull, upstream would not have accepted these changes and we would need to use a custom fork.
ps. I've used the first version of AWS-SDK-GO but I'm not optimistic that using v2 (or. its middlewares) would made any difference
RFC: How should s5cmd act when versioning flags are used with Google Cloud endpoints?
Note Only
bucket-version
command anddu
command with--all-versions
flag works accurately with GCS.
- Should it print an error and cancel the operation?
- Should it print a warning and continue to the operation even though the result would not be the one user expected?
[^fork]: The attempt may be seen here.
🥇