s3fs Consider having bucket be a configuration parameter for `S3FileSystem`

trafficstars

Background

In s3, the concept of a bucket and a key are separate. In the current design of s3fs, both are used together to define a full path. This poses a number of complications:

When parsing s3://bucket/path/to/file, traditional uri schemes would have bucket be the host, and the remainder be the path. In dask we've ended up special casing s3fs and gcfs when handling this parsing, which is unfortunate. If the bucket was part of the filesystem, this special casing would no longer be necessary.
Buckets must be handled differently from paths. See e.g. the code for touch (https://github.com/dask/s3fs/blob/master/s3fs/core.py#L793-L808) is split depending on if the path is just a bucket or a full key. Buckets have different semantics, when in a filesystem all objects should have the same semantics. See glob for another example (https://github.com/dask/s3fs/blob/master/s3fs/core.py#L548-L549). You can't glob across buckets, because that's all of s3.
Certain s3 operations only work on objects all under the same bucket. See e.g. https://github.com/dask/s3fs/blob/master/s3fs/core.py#L708-L737
In use it seems many projects rely on a single bucket, so having a bucket for a filesystem object doesn't seem to be that restrictive. This is only per a small sample of users (really just people in my office), so this may not be 100% true.

A proposal for changing

Since s3fs is already in use, we can't perform a breaking change like this without some deprecation cycle. I'm not actually sure if we need to break the old behavior though, but supporting both might make the code more complicated. Proposal:

Add a bucket=None kwarg to the __init__ of S3FileSystem. The default is the current behavior.
If bucket is specified, it becomes the bucket in all specified paths.

Since s3fs currently works with paths prefixed with s3://, I'm not sure how that should be handled. IMO s3fs shouldn't accept paths specified with s3:// (and should instead accept plain paths as if they were local paths), but that may be hard to deprecate cleanly. Perhaps if the filesystem has a bucket specified, error on paths that start with s3://.

If deprecation of the existing behavior is desired, a warning could be raised on __init__ whenever bucket is None, and eventually the existing behavior phased out.

Anyway, this is just one person's opinion. Figured it'd be worth bringing up to get some feedback.

Jan 19 '18 03:01 jcrist

cc @martindurant, @mrocklin. Also cc @danielfrg based on conversation from earlier today.

Jan 19 '18 03:01 jcrist

As one data point, there's another python s3 filesystem here (with a bit of a different interface/design) that sets bucket as a configuration parameter.

Jan 19 '18 03:01 jcrist

Some thoughts on this.

having bucket config as an optional mode seems like it would be more confusing than not, since the code would have two modes of operation
at some point, listing the buckets that a user owns seems like a thing that is desired, and getting formation at the bucket level in general
agree that almost all operation are intra-bucket, and maybe you are right that this matches most people's expectation (don't know)
There does seem to be a tendency for people to want to supply URLs as 's3://bucket/path' even when dealing with the explicitly s3 interface here; this probably comes from the various CLI tools and Spark/Dask usage. Is it not reasonable to assume that the path that works in Dask works identically here?
I could imagine this working by having a concept of cwd, i.e., we have a current location, and so paths that look relative (no leading '/') are within. Does this mean a cwd that can descend down the directory tree, and how do you form URLs with "s3:"? I don't know.

Jan 19 '18 16:01 martindurant

Just my opinion here, is that most of the time I expect to supply a bucket when I try to access data on S3. Is kinda also similar to what boto does, you create a bucket object and then pass a key to get/set the file contents.

There is indeed some things like hadoop distcp that allow URIs like: s3://bucket/path/ but I would bet that they just parse the URI to bucket + path for the Java lib they use.

Jan 19 '18 16:01 danielfrg

FWIW as a user I find the s3://bucket/path approach intuitive. Also when I share data I often want to share a single address, not a bucket and an address.

Jan 19 '18 20:01 mrocklin

FWIW as a user I find the s3://bucket/path approach intuitive. Also when I share data I often want to share a single address, not a bucket and an address.

This would be doable making the bucket part of the S3FileSystem config - tools like dask or pandas don't access the S3FileSystem object directly. Currently things like host go as part of the configuration when parsing a URI, which would then be forwarded using dask's normal filesystem init, removing our need for a s3fs/gcsfs special case.

Nov 05 '18 18:11 jcrist

s3fs s3fs copied to clipboard

Consider having bucket be a configuration parameter for `S3FileSystem`

Background

A proposal for changing

s3fs
s3fs copied to clipboard