bulk-data icon indicating copy to clipboard operation
bulk-data copied to clipboard

Add optional parameter to specify a cloud bucket as an output location?

Open gotdan opened this issue 4 years ago • 1 comments

Open Questions:

Auth: Do we want to only target servers that have pre-configured write permissions to a bucket, or do we need a way to pass in auth credentials to the server? If so, what would this look like?

Path: Does the we need additional information in addition to the bucket name, like file prefix (eg. to support a "folder" within the bucket that incorporates a timestamp), or service provider (to support use cases where the server is writing to a bucket provided by a different cloud vendor)?

Completion: Should we require that the output manifest file be written to the bucket last so it could be used as an event to trigger followup actions (eg. a de-id or db load) or would we expect clients to use job polling to determine all files have been written?

gotdan avatar Sep 03 '20 14:09 gotdan

Re: auth and paths, I've been super impressed with the open-source rclone project, which has thought carefully and comprehensively about authorization for bucket access.

They have a JSON file describing their schema for cloud storage services, including provider types (e.g., s3) and providers which offer endpoints (e.g., AWS or DigitalOcean or Wasabi, all of whom offer s3-compatible APIs). So I might have a remote configured like:

{
    "access_key_id": "redacted",
    "acl": "private",
    "endpoint": "s3.wasabisys.com",
    "env_auth": "false",
    "provider": "Wasabi",
    "secret_access_key": "redacted",
    "type": "s3"
}

Anyway, if we wanted to standardize on how to convey access, the rclone config format is a great place to look.

We might also try to profile some "common denominator" of shared access signatures / signed URLs at the bucket level.


... but even if we leave authorization out of band, I think having a way to point to a bucket would be lovely.

jmandel avatar Sep 03 '20 16:09 jmandel