bulk-data
bulk-data copied to clipboard
Add optional parameter to specify a cloud bucket as an output location?
Open Questions:
Auth: Do we want to only target servers that have pre-configured write permissions to a bucket, or do we need a way to pass in auth credentials to the server? If so, what would this look like?
Path: Does the we need additional information in addition to the bucket name, like file prefix (eg. to support a "folder" within the bucket that incorporates a timestamp), or service provider (to support use cases where the server is writing to a bucket provided by a different cloud vendor)?
Completion: Should we require that the output manifest file be written to the bucket last so it could be used as an event to trigger followup actions (eg. a de-id or db load) or would we expect clients to use job polling to determine all files have been written?
Re: auth and paths, I've been super impressed with the open-source rclone
project, which has thought carefully and comprehensively about authorization for bucket access.
They have a JSON file describing their schema for cloud storage services, including provider types (e.g., s3
) and providers which offer endpoints (e.g., AWS
or DigitalOcean
or Wasabi
, all of whom offer s3-compatible APIs). So I might have a remote configured like:
{
"access_key_id": "redacted",
"acl": "private",
"endpoint": "s3.wasabisys.com",
"env_auth": "false",
"provider": "Wasabi",
"secret_access_key": "redacted",
"type": "s3"
}
Anyway, if we wanted to standardize on how to convey access, the rclone config format is a great place to look.
We might also try to profile some "common denominator" of shared access signatures / signed URLs at the bucket level.
... but even if we leave authorization out of band, I think having a way to point to a bucket would be lovely.