s3fs
s3fs copied to clipboard
Consider having bucket be a configuration parameter for `S3FileSystem`
Background
In s3, the concept of a bucket and a key are separate. In the current design of s3fs, both are used together to define a full path. This poses a number of complications:
- When parsing
s3://bucket/path/to/file, traditional uri schemes would havebucketbe the host, and the remainder be the path. Indaskwe've ended up special casings3fsandgcfswhen handling this parsing, which is unfortunate. If the bucket was part of the filesystem, this special casing would no longer be necessary. - Buckets must be handled differently from paths. See e.g. the code for
touch(https://github.com/dask/s3fs/blob/master/s3fs/core.py#L793-L808) is split depending on if the path is just a bucket or a full key. Buckets have different semantics, when in a filesystem all objects should have the same semantics. Seeglobfor another example (https://github.com/dask/s3fs/blob/master/s3fs/core.py#L548-L549). You can't glob across buckets, because that's all ofs3. - Certain s3 operations only work on objects all under the same bucket. See e.g. https://github.com/dask/s3fs/blob/master/s3fs/core.py#L708-L737
- In use it seems many projects rely on a single bucket, so having a bucket for a filesystem object doesn't seem to be that restrictive. This is only per a small sample of users (really just people in my office), so this may not be 100% true.
A proposal for changing
Since s3fs is already in use, we can't perform a breaking change like this without some deprecation cycle. I'm not actually sure if we need to break the old behavior though, but supporting both might make the code more complicated. Proposal:
- Add a
bucket=Nonekwarg to the__init__ofS3FileSystem. The default is the current behavior. - If
bucketis specified, it becomes the bucket in all specified paths.
Since s3fs currently works with paths prefixed with s3://, I'm not sure how that should be handled. IMO s3fs shouldn't accept paths specified with s3:// (and should instead accept plain paths as if they were local paths), but that may be hard to deprecate cleanly. Perhaps if the filesystem has a bucket specified, error on paths that start with s3://.
If deprecation of the existing behavior is desired, a warning could be raised on __init__ whenever bucket is None, and eventually the existing behavior phased out.
Anyway, this is just one person's opinion. Figured it'd be worth bringing up to get some feedback.
cc @martindurant, @mrocklin. Also cc @danielfrg based on conversation from earlier today.
As one data point, there's another python s3 filesystem here (with a bit of a different interface/design) that sets bucket as a configuration parameter.
Some thoughts on this.
- having bucket config as an optional mode seems like it would be more confusing than not, since the code would have two modes of operation
- at some point, listing the buckets that a user owns seems like a thing that is desired, and getting formation at the bucket level in general
- agree that almost all operation are intra-bucket, and maybe you are right that this matches most people's expectation (don't know)
- There does seem to be a tendency for people to want to supply URLs as 's3://bucket/path' even when dealing with the explicitly s3 interface here; this probably comes from the various CLI tools and Spark/Dask usage. Is it not reasonable to assume that the path that works in Dask works identically here?
- I could imagine this working by having a concept of
cwd, i.e., we have a current location, and so paths that look relative (no leading'/') are within. Does this mean acwdthat can descend down the directory tree, and how do you form URLs with "s3:"? I don't know.
Just my opinion here, is that most of the time I expect to supply a bucket when I try to access data on S3. Is kinda also similar to what boto does, you create a bucket object and then pass a key to get/set the file contents.
There is indeed some things like hadoop distcp that allow URIs like: s3://bucket/path/ but I would bet that they just parse the URI to bucket + path for the Java lib they use.
FWIW as a user I find the s3://bucket/path approach intuitive. Also when I share data I often want to share a single address, not a bucket and an address.
FWIW as a user I find the s3://bucket/path approach intuitive. Also when I share data I often want to share a single address, not a bucket and an address.
This would be doable making the bucket part of the S3FileSystem config - tools like dask or pandas don't access the S3FileSystem object directly. Currently things like host go as part of the configuration when parsing a URI, which would then be forwarded using dask's normal filesystem init, removing our need for a s3fs/gcsfs special case.