s5cmd icon indicating copy to clipboard operation
s5cmd copied to clipboard

command/storage: add versioning support

Open kucukaslan opened this issue 2 years ago • 4 comments

This commit adds versioning support to the s5cmd.

  • Added --all-versions flag to ls, rm, du and select subcommands to apply operation on(/over) all versions of the objects.
  • Added --version-id flag to cat, cp/mv, rm, du and select subcommands to apply operation on(/over) a specific versions of the object.
  • Added bucket-version command to configure bucket versioning. Bucket name alone returns the bucket versioning status of the bucket. Bucket versioning can be configured with set flag which only accepts.
  • Added --raw flag to cat and select subcommands. It disables the wildcard operations.

Note: Google Cloud Storage uses a different approach for versioning. So with current implementation, s5cmd cannot use or retrieve generation numbers . However, bucket-version command and du command with all-versions flag works as expected since they do not use version ids.

Fixes: #218 Fixes: #386

kucukaslan avatar Jul 26 '22 14:07 kucukaslan

Status as of July 26 (Outdated):

  • add all-versions flag to following subcommands:
    • [x] ls
    • [ ] rm ( only with wildcards,does not delete delete markers)
    • [x] du
  • add version-id flag to following sub commands:
    • [ ] cp/mv
    • [x] cat
    • [x] rm
    • [x] du
  • format outputs
    • [ ] ls ...

Background

You may refer to https://github.com/peak/s5cmd/issues/386#issuecomment-1176069705 for background of changes

Current problem (I'm trying to solve):

rm uses expandSource method which handle keys differently when wildcards are used (or not). So It doesn't work when all-versions of a particular key was to be deleted, it just put a delete marker, though rm succesfully deletes objects when wildcards are. used.
To fix this, we need to pass value of "all-versions" flag expandSource. Hence I propose to

  • put value of all-version flag as a field to the URL (instead of passing to s3 object and using it in s3.List method).
  • I want to change the type of src & dst fields of commands (Copy, Delete etc.) to URL (from string) accordingly.

Alternatively, we can add new parameters to expandSource method to pass all-versions flag.

Note You can refer to Kucukaslan@df94602 to see what kind of changes I'm intended to do in code as rm.go and Delete being an example.

Example usage syntax

s5cmd ls --all-versions s3://bucket/

s5cmd rm --all-versions "s3://bucket/*"
s5cmd rm --all-versions s3://bucket/key

s5cmd du --all-versions "s3://bucket/*"


s5cmd cat --version-id smUtf8Thng s3://bucket/key

s5cmd du --version-id smUtf8Thng s3://bucket/key

s5cmd rm --version-id smUtf8Thng s3://bucket/key

kucukaslan avatar Jul 26 '22 14:07 kucukaslan

Up to date status

I've made the changes to the Command objects and url.URL I mentioned earlier.

Implementation

  • add all-versions flag to following subcommands:
    • [x] ls ( including delete markers)
    • [x] rm ( including delete markers)
    • [x] du
    • [x] select
  • add version-id flag to following subcommands:
    • [x] cp/mv
    • [x] cat
    • [x] rm
    • [x] du
    • [x] select
  • Added bucket-version command to configure bucket versioning.
    • [x] get status
    • [x] set bucket versioning

Output formats

  • [x] cp/mv : Didn't change. It is ambiguous that whether version-id that should be printed belongs to the source or destination.
  • [x] cat: Didn't change. It should print the content of the file
  • [x] du: Didn't change. It is generally used for multiple objects and return their total disk usage. ls can be used with all-versions to see sizes of each version.
  • [x] select: Didn't change. It should print the result of the query.
  • all-versions flag :
    • [x] ls ( including delete markers)
      Example 2022/08/10 09:53:03 3171 log/log.go 3/60O30C1G60O30C1G60O30C1G60O30C1G60O30C1G60O30C1J0109VB3PCPEL0PO= 2022/08/10 09:53:28 23 log/log.go 3/60O30C1G60O30C1G60O30C1G60O30C1G60O30C1G60O30C1K01TF8AJ9OV6NF7O= {"key":"s3://mcks5cmd/log/log.go","etag":"b96979fea4ce57766596e47d1b6cc5e1","last_modified":"2022-08-10T09:53:03.124Z","type":"file","size":3171,"storage_class":"STANDARD","version_id":"3/60O30C1G60O30C1G60O30C1G60O30C1G60O30C1G60O30C1J0109VB3PCPEL0PO="} {"key":"s3://mcks5cmd/log/log.go","etag":"05f2faf2442033698d1aa6778ca70c1b","last_modified":"2022-08-10T09:53:28.325Z","type":"file","size":23,"storage_class":"STANDARD","version_id":"3/60O30C1G60O30C1G60O30C1G60O30C1G60O30C1G60O30C1K01TF8AJ9OV6NF7O="}
    • [x] rm ( including delete markers)
      Example rm s3://mcks5cmd/log/log.go 3/60O30C1G60O30C1G60O30C1G60O30C1G60O30C1G60O30C1J0109VB3PCPEL0PO= rm s3://mcks5cmd/log/log.go 3/60O30C1G60O30C1G60O30C1G60O30C1G60O30C1G60O30C1N03KIF6K5MOK5168= {"operation":"rm","success":true,"source":"s3://mcks5cmd/log/log.go","version_id":"3/60O30C1G60O30C1G60O30C1G60O30C1G60O30C1G60O30C1J0109VB3PCPEL0PO="} {"operation":"rm","success":true,"source":"s3://mcks5cmd/log/log.go","version_id":"3/60O30C1G60O30C1G60O30C1G60O30C1G60O30C1G60O30C1N03KIF6K5MOK5168="}
  • version-id:
    • [x] rm
      Example rm s3://mcks5cmd/log/log.go 3/60O30C1G60O30C1G60O30C1G60O30C1G60O30C1G60O30C1K01TF8AJ9OV6NF7O= {"operation":"rm","success":true,"source":"s3://mcks5cmd/log/log.go","version_id":"3/60O30C1G60O30C1G60O30C1G60O30C1G60O30C1G60O30C1J0109VB3PCPEL0PO"}

Tests

  • [x] prepare versioning setup for gofakes3
  • testing all-versions flag
    • [x] ls & rm
    • [x] du
  • testing version-id flag
    • [x] cp/mv
    • [x] cat
    • [x] rm
    • [x] du
  • command validations
    • [x] cannot use both of the flags together
    • [x] this flags are only meaningful to remote files
    • [x] make validation checks reusable

Google Cloud

Warning Google Cloud Storage uses a different approach for versioning. So with current implementation, s5cmd cannot use or retrieve generation numbers . However, with all-versions flag du works as expected since it does not use version ids, ls lists object metadata except the generation numbers etc.

Commentary & Known Issues & Discussion Topics

  • gofakes3 package that we use in our tests supports versioning only with in memory backend, so I've added another method to setup fake server.
  • There is was a bug when I try to delete from gofakes3 server using s5cmd rm. Despite using version-id/all-versions flags, the server does not permanently delete the corresponding objects and just adds delete marker to them. Interestingly:
    • this bug does not happen when I use aws s3api delete-object to connect gofakes3 server.
    • this bug does not happen when I use s5cmd rm to connect real AWS S3 server.
    • other subcommands of s5cmd works as expected. I'm currently trying to identify root cause of this bug and to fix it.

Note It turned out that gofakes3 does not support multidelete for versioned objects. At the moment we've fixed it in igungor's fork with https://github.com/igungor/gofakes3/pull/6. Also we've a PR to fix it in upstream too https://github.com/johannesboyne/gofakes3/pull/69.

  • I do not discern the objects and delete markers when all-version flag is used.
    • Should we show the distinction in outputs?
    • Should we require yet another flag to take delete markers into account (and ignore them otherwise)

Note We continue not to discern objects and delete markers, in this case. No special flag.

  • Both of the s3 keys and object versions have maximum length of 1024 byte (UTF-8 string). It, potentially, might require a lot of whitespaces to align VersionID and Key columns in output (especially because we don't know, in advance, what their respective maximum lengths will be. Should we apply adaptive alignment? I mean: Each column is aligned according to the longest element so far.)

Note We will only align key to left with fixed "50" (?) characters width and append the versionID (prefixed with a space) to it aws s3api prints out json mc has an example output here

kucukaslan avatar Aug 03 '22 15:08 kucukaslan

Request for Comments

Configuring bucket versioning

Warning We decided to add "bucket-version command to configure bucket versioning. Removed bucket versioning related logic from the version command."^vers

add set and get flags to version subcommand

Alternatively we can remove the get flag and use this syntax:

$ s5cmd version v0.0.0-dev $ s5cmd version s3://bucket Bucket versioning for "bucket" is "Enabled" $ s5cmd version --set Enabled s3://mcks5cmd Bucket versioning for "bucket" is set to "Enabled"

ps. At the moment to get bucket versioning we need to write: $ s5cmd version --get s3://bucket

JSON Unmarshall'ing storage.Object to display versionID

Warning We decided to add a VersionId field to storage.Object, just for this use case.

JSON Marshall should give version ids. But we marshal the storage.Object type https://github.com/peak/s5cmd/blob/3a49799e064477c49c252d4e807cc66de685c913/command/ls.go#L294 which does not have versionID field https://github.com/peak/s5cmd/blob/3a49799e064477c49c252d4e807cc66de685c913/storage/storage.go#L105-L114

kucukaslan avatar Aug 08 '22 08:08 kucukaslan

On the Google Cloud Storage

It has generation numbers analogous to S3 Version Ids. In its REST API it uses generation tag while the AWS S3 uses VersionId tag. So the (Un)Marshalers in AWS SDK does not handle generation tag. Hence it can neither get generation number from the response nor can send it with the request.

I don't think that intervening into Marshaler's logic via request handlers would be an acceptable/practical solution, even if it were to be possible.

As a last resort I've tried to modify AWS. SDK to add Generation fields to relevant types[^fork]. It helped to read generation numbers in List request without breaking any other thing, that is ls --all-versions worked with GCS.

However, it still failed to use those generation numbers in requests, that is --version-id flag and rm/cp... --all-versions did not work. I've tried a few other modifications to SDK, but none of them worked with GCS without breaking the AWS S3.
Even if these attempts were to be succesfull, upstream would not have accepted these changes and we would need to use a custom fork.

ps. I've used the first version of AWS-SDK-GO but I'm not optimistic that using v2 (or. its middlewares) would made any difference

RFC: How should s5cmd act when versioning flags are used with Google Cloud endpoints?

Note Only bucket-version command and du command with --all-versions flag works accurately with GCS.

  • Should it print an error and cancel the operation?
  • Should it print a warning and continue to the operation even though the result would not be the one user expected?

[^fork]: The attempt may be seen here.

kucukaslan avatar Sep 02 '22 13:09 kucukaslan

🥇

igungor avatar Jun 16 '23 09:06 igungor