Make S3Path compatile with pandas
A suggestion to make S3Path more compatible with pandas.
if you try something like :
path = S3Path('/bucket/sample.csv')
pandas.read_csv(path.open('r'), ...)
This provides a file equivalent as first argument and works as expected!.
However if you try :
pandas.read_csv(path, ...)
This provides a path equivalent as first argument but it does not work!
This is because pandas will call path.fspath() to get an actual path, which in this case is '/bucket/sample.csv' and points to a local path.
If instead fspath() returned 's3://bucket/sample.csv' than pandas could load the data directly and respect the path semantics.
Please note that pandas internally uses s3fs and fsspec to open the s3 url, unlike S3Path which uses smart_open I believe. In any case this would make the usage of S3Path as a path in the pandas context much more intuitive!
This could easily be achieved by adding a proper fspath() member to the S3Path. Hope this makes sense.
Thank you.
Thanks @furechan, I like the idea!
A few quick technical thoughts...
To implement your suggestion, we could probably simply implement fspath() to return the same thing as as_uri(), which we are discussing in #77. (The question there is what encoding, if any, needs to be done for the s3://... URI... hopefully someone can quickly track down a definitive answer. Then it would be just a few lines of code to implement.)
One disadvantage which I see right away is that I need to configure some ACL settings for S3Path, and if Pandas is using s3fs, then I'd have to configure s3fs separately. If you're using default settings, everything should work very nicely. On the other hand, if you're using a custom config like me, this will make it very difficult to understand what's going on (since Pandas is sneakily switching the interface from S3Path to s3fs).
So actually I don't like that Pandas calls .fspath(). Why not call .open() instead??? Maybe Pandas should be more compatible with us! :smile:
Perhaps a good middle ground would be to pass the URI and raise a warning when .fspath() is invoked? ... But then I could also see an argument that .fspath() should only return a genuine filesystem path, which the URI is not. Thus Pandas could attempt .as_uri() in case .fspath() fails.
Related: a while back I submitted a patch to Plotly to make it S3Path and pathlib compatible. It was just right now released in the new v5.0, so I'm very pleased with that.
It ended up being quite a lot more work than I initially expected. But hopefully it's worth it.
hi @maresb,
I can confirm that fsspec does not require encoding of the s3 url. Just tried it by hand. So __fspath__() would just need to return the raw path with the s3:// prefix without any url encoding.
Yes, it'd be nice if there was a 'path open' protocol in python or pandas. In a sense maybe that's what the fsspec is trying to achieve ...
In my opinion, the current __fspath__() value is problematic because it makes the path look like a local path which it is not. You could be writing to the local file system thinking you are writing to S3. So it seems that __fspath__() should either raise an exception or return an s3 url.
For the configuration you may want to be aligned with the big projects and the likes of pandas, dask, intake that already use fsspec and s3fs etc. s3fs seems to mostly use the default boto session/client config ... One option to consider is possibly use s3fs along with or as an alternative to smart_open !?
Thanks
Just to clarify: it is __fspath__() not fspath() (I had trouble with the formatting ...)
@furechan,
I can confirm that fsspec does not require encoding of the s3 url. Just tried it by hand. So fspath() would just need to return the raw path with the s3:// prefix without any url encoding.
Excellent, thanks!!! That's great to know.
In my opinion, the current fspath() value is problematic because it makes the path look like a local path which it is not. You could be writing to the local file system thinking you are writing to S3. So it seems that fspath() should either raise an exception or return an s3 url.
I fully agree.
For the configuration you may want to be aligned with the big projects and the likes of pandas, dask, intake that already use fsspec and s3fs etc. s3fs seems to mostly use the default boto session/client config ... One option to consider is possibly use s3fs along with or as an alternative to smart_open !?
It turns out that @liormizr just switched to smart_open a few months ago. I don't know anything about the decision, but I'm sure @liormizr has good reasons. (To clarify, I'm no maintainer here, I'm just an interested power-user with a few recent pull requests.)
I'm quite ignorant of the various alternatives. I've never used fsspec or s3fs myself. I agree that it'd be interesting to see about making things interoperable.
Speaking of alternatives, the other day I told my colleague about S3Path, and he came back with this instead. It may be very interesting to compare it with this project.
To make any decisions we need to hear from @liormizr about the direction he thinks we should take. I can say from my experience that he's very responsive to feedback and PRs!
Hi guys, Sorry for the delay...
Excellent discussion :slightly_smiling_face:
So first subject is the fspath: We already had an issue asking for that #59, I wanted to see if there is more requests before changing the standard pathlib behaviour. I see your points, lets change it so fspath method will return s3 url (without encoding).
About cloudpathlib looks like a cool project. When I started this project I had to decide if I want a project that handle multiple cloud providers or just AWS S3. My personal decision was keep it simple. Do a small clean and simple library that extend for pathlib with S3 flavour and maybe in the future create more like that. A good example for that is pathy a small extend for pathlib with Google Storage flavour.
About smart_open - Good project, highly recommended, And gave me the file object with the optimization that we needed.
Thats why we selected it.
Last thing, @maresb you are more then welcome to join me in this project if you want to :wink:
Sounds good! Thanks
I'm confused about why this issue is closed. It seems to me like there are outstanding todos here, namely to fix .__fspath__().
I actually like @liormizr's original suggestion the best. Namely, I think that calling .__fspath__() should raise an exception, and this exception should suggest to the user that they use either .as_uri() or with path.open() as file_handle:. That way the user must be explicit about the intended behavior. Otherwise we are making an important choice implicitly and silently, and this is likely to lead to confusion.
Or would you still prefer to return .as_uri()?
@liormizr, I'm curious, what was the particular optimization you needed?
I agree, this issue is still open.
@maresb I see what you are saying But why do you think that fspath() should raise an exception?
remember in order that S3Path will stay an extension of pathlib it should keep the same rules. From a quick check I see:
- All PathLike objects should have fspath method need to return a bytes / str object (with the preference being for str) I don't want to break it with an exception... link
- In pathlib the fspath method simply return the str representation of the path object (without uri encoding) For example from pathlib doc's:
>>> p = PurePosixPath('/etc')
>>> str(p)
'/etc'
>>> p = PureWindowsPath('c:/Program Files')
>>> str(p)
'c:\\Program Files'
So I understand why we thought in the beginning that PureS3Path inherited from PurePath So depends on your local system the str representation can be changed - Not the most explicit ...
Maybe (I'm not sure that it's the best choice) we can say that PureS3Path inherited explicitly from PurePosixPath, And that the str representation should look like that:
>>> p = PurePosixPath('/etc')
>>> str(p)
'/etc'
>>> p = PureWindowsPath('c:/Program Files')
>>> str(p)
'c:\\Program Files'
>>> p = PureS3Path('bucket/sample.csv')
>>> str(p)
's3://bucket/sample.csv'
Need to think about that... :thinking: What do you guys think?
Sorry for closing the issue... my mistake!
I still think returning the s3 url is better than a raw path that is in no way usable by the file system ....
As an interesting note I checked what cloudpathlib is doing... It returns a sort of joint path relative to temp, something along the line of /temp/tmp.../bucket/... which points to a valid local path. I think they may use this file as cache for i/o ...
My reason for why __fspath__() should raise an exception is that an S3Path object has no corresponding path on the local filesystem. (Similarly to how I expect path.open().fileno() to be an UnsupportedOperation.) I'd prefer that an exception be raised, rather than passing along something likely to be unexpected. The URI may or may not be handled correctly. I'm thinking of the case where a package tries something like open(os.fspath(path)). Then the offending package would raise some sort of "FileNotFoundException", and we'd forgo our opportunity to give a helpful error message.
On the other hand, I can see your point how s3://bucket/sample.csv is a valid representation of the path in some sense, even if it doesn't correspond to a local object. It really depends on your definition of what a "filesystem object" is allowed to be, and I haven't yet been able to find a precise definition.
That's very interesting what cloudpathlib is doing. However, I can see that leading to big problems when a package looks for another file relative to the cached file...
Hi @maresb,
As you say, what is the definition of a file system ? In a strict term you can say its os.open, but even there is fails to open a windows path on linux for example. We may say raising an exception is more pythonic, explicit better than implicit etc ... Which I completely understand.
If instead you say a file system is something like fsspec than the s3 url is correct. This is what happens in pandas and also amazon's own aws-data-wrangler where s3 urls are commonly used as path. This fits the least surprise argument ...
An other way to say it could be, a WindowsPath opens in the 'Windows' file system, a UnixPath opens in the 'Unix' file system and an S3Path opens in the 'S3' file system ...
OK
@furechan , @maresb
I think that I'll try to change PureS3Path string representation to show the full uri (s3://
I'll update in a week or so. Let me know if you have more thoughts on the subject
I think @furechan summarized the situation very eloquently in his last post.
I have been thinking that it might be nice to support both behaviors by having both a strict and a non-strict mode. In the strict mode, __fspath__() would raise an informative exception, while in non-strict mode it would return s3://bucket-name/unencoded-key.
This raises a few further questions...
- Would this feature even be worthwhile?
- What should be the default behavior? Strict or non-strict?
- How could we allow setting the behavior while minimizing boilerplate?
For 1. I think I have to defer to @liormizr. I think I'd find it useful for debugging, but I would understand if he wants to keep it simple and support only the non-strict behavior.
For 2. I don't really care so much, as long as it's configurable. It sounds like you'd favor non-strict as default.
For 3, we may be able to use a property like S3Path.strict_fspath_mode = True. Then one can continue using the header from s3path import S3Path, and the mode could be easily changed with a single line.
@maresb I understand your point. The configuration part, I prefer keeping as much as possible the Path objects API like pathlib. What about adding another Path objects with those options?
For example:
import os
from s3path import S3Path, S3StrictPath, S3UriPath # Need to think about the names
>>> p = S3Path('bucket/sample.csv')
>>> os.fspath(p)
'/bucket/sample.csv'
>>> p = S3UriPath('bucket/sample.csv')
>>> os.fspath(p)
's3://bucket/sample.csv'
>>> p = S3StrictPath('bucket/sample.csv')
>>> os.fspath(p)
UnsupportedOperation: S3StrictPath unsupported file system representation
What do you think?
I actually find the Pathlib hierarchy already quite confusing. Introducing further classes into the hierarchy would make things even more complicated...
I also don't really see any use case for mixing strict and non-strict behaviors simultaneously. I envision strict_fspath_mode to be a global variable in the s3path module namespace, with a shortcut via the S3Path.strict_fspath_mode property. I'm just trying to avoid a mess like
from s3path import S3Path, set_strict_fspath_mode
set_strict_fspath_mode(True)
which could instead be the much more readable
from s3path import S3Path
S3Path.strict_fspath_mode = True
Is there any disadvantage to adding such a property to the S3Path class, even though it's not part of the pathlib API? If it is a global config variable, then you'd set it on initialization, so there's minimal risk of breaking interoperability with Path in the core code. And if you don't know about it, it won't cause any problems, right?
Do we really need so many options ?
Configuration options, as I see it, have a problem as they have unexpected side effects that some other code might not know about or else have to keep track of. If possible having a well defined behavior is better...
Different classes are a clean way to do this. This is how a user would presumably try to remedy the situation by specializing the class himself. This comes at the cost of a somehow confusing hierarchy. I could live with it as a user but is it worth it rather than pick a side and define what an S3Path is once and for all ...
If offering multiple classes I would still prefer S3Path to have the best default behavior like:
- S3Path -> s3://...
- S3RawPath -> /...
- S3StrictPath -> Exception
As a side note, I don't really see any practical use case for S3StrictPath, except maybe for the exception purist ;)
@furechan, I absolutely need the strict behavior. Perhaps I've done a poor job at explaining why...
I rely on s3path to correctly set ACLs when I create a file. If the ACLs are not correctly set, then others will not be able to read the files which I write. I really need to avoid this possibility, because I will not notice the problem right away, and it will cause a huge mess in the future.
My view on the s3path abstraction is that it provides an interface for .open(), .write_bytes(), etc. which respect my ACLs. If Pandas or Tensorflow or whatever stupid library wants to ignore the pathlib interface, call os.fspath(), and sidestep my ACLs, then this is a huge problem for me.
I also want to be able to put a single documented line in my code like
# Don't change this or bad things will happen.
S3Path.strict_fspath_mode = True
That becomes much harder if there are a bunch of strange classes floating around.
Does that clarify things?
You guys have very interesting opinions :slightly_smiling_face:
I agree with @furechan that classes are more explicit @furechan the only thing is that if we will change S3Path behaviour we will need to do a bigger version (change API...) Give me a good idea for a S3UriPath name, we can deploy and see if users us it and change the default S3Path behaviour if needed.
@maresb I understand your point as well. What about mixing them toggle? So like pathlib have Path that detect if you are in Windows or Linux and return the relevant Path object we can do the same:
from s3path import register_configuration_parameter, S3PathFactory
from s3path import S3Path, S3StrictPath, S3UriPath # Need to think about the names
register_configuration_parameter(S3Path('/bucket1/'), factory_cls=S3Path)
register_configuration_parameter(S3Path('/bucket2/'), factory_cls=S3StrictPath)
register_configuration_parameter(S3Path('/bucket3/'), factory_cls=S3UriPath)
assert S3PathFactory('/bucket1/key') is S3Path('/bucket1/key')
assert S3PathFactory('/bucket2/key') is S3StrictPath('/bucket2/key')
assert S3PathFactory('/bucket3/key') is S3UriPath('/bucket3/key')
@maresb in your case it means that as the start of your code you will need to do:
register_configuration_parameter(S3Path('/'), factory_cls=S3Path)
# Use S3PathFactory in all the code
instead of
S3Path.strict_fspath_mode = True
# Use S3Path...
So S3PathFactory and configuration is you want a dynamic behavior S3Path, S3StrictPath, S3UriPath is you want explicit classes
The only thing is again if we won't change S3Path behaviour it will be easier to deploy it.
About development, I don't think that it's a lot of work... :wink:
What do you think?
hi @maresb,
got it!
hi @liormizr,
S3UriPath seems like a good name for what we are talking about!
Sorry, I feel like I'm missing something really basic here.
What exactly would be problematic with #84?
Hmm, so the one thing I can think of is as follows:
Suppose I install a package which uses S3Path in a way which is incompatible with strict_fspath_mode, but then I want to turn on strict_fspath_mode for myself. Then I'd break the subpackage because it's a global variable.
On the other hand, I feel like well-written subpackages should be compatible with strict_fspath_mode. Maybe I'm too idealistic. :stuck_out_tongue:
Nice!
So I'll start commenting here, But if we'll see that we want to progress with the PR, let's move the conversation to there.
PureS3Path.as_uri: The only reason that we overwrite this method is to change the docstrings to S3 behaviour.
The method that control the uri creation is in _S3Flavour make_uri method. The flavour implements a particular (platform-specific) set of path semantics. In our case s3 semantics. This make_uri return a uri encoded string using urllib parsers.
There are cases that s3path users use s3path and other path like objects in the same code flows. Depends on the needs, environment, object storage. There are cases that with s3path users us S3, LocalStack, MinIO, and even Azur blob. changing the uri behavior may be a dramatic change.
PureS3Path.fspath: Here it's kind of the same. PureS3Path is a PathLike object. "The method should only return a str or bytes object, with the preference being for str." In Path objects (pathlib base PathLike) the fspath returns the string representation of the path object.
So I'm having a hard time raising there an exception. Or changing the string representation..
Think of projects like pandas that can use PathLike objects and expect them with a certain behaviour.
I want to keep S3Path with no surprises. But I understand that at some cases the default behaviour is less nice for S3 relater usages. So I'm suggesting to add custom flavers, custom path objects that change the default behaviours. And see what is the reaction, Maybe in the feature we will decide that it's the right behavior and change S3Path.
There are features that are very easy to add, but to remove them it's impossible.
Now about strict_fspath_mode. when we started thinking about configurations I didn't like it. I thought that S3 configurations can be not inside the code. Let it be in "DevOps" level as boto3 environment variables or boto3 ~/.aws config directory or in aws iam level if you are in AWS. When we decided to add configuration we added register_configuration_parameter So if we are adding a way to config some object that can switch between path like objects with different flavours, I rather not create another way to config the library.
@maresb I hope I managed to explain my self and not confused you more we are all idealists :smiley:
@liormizr thanks so much for the very detailed response! This is very helpful. I think I have my confusion sorted now.
I see now that I should have overridden _S3Flavour.make_uri instead of PureS3Path.as_uri. I also see that _S3Flavour inherits from _PosixFlavour so that make_uri is defined as "file://" + urllib.parse.quote_from_bytes(bytes(path)).
Regarding expected behavior and surprises, I think my confusion arose from conflating two issues which are probably better handled separately:
- Figure out what is the "correct" URI representation of a S3Path, and figure out how to change (or configure or not change) the current behavior to avoid surprising users.
- For my use case, I need the option to entirely disable
fspathso that other packages won't inadvertently circumvent my ACLs.
As far as I can tell, an S3StrictPath will not solve my problem; my strict mode should be decoupled from the choice of URI type to return. (I would need to disable fspath for all choices of URI behavior, so turning strict mode into a URI behavior misses the point.)
I propose the following:
- We drop further discussion of
strict_fspath_modeandS3StrictPathin this issue. - I open a separate issue for
strict_fspath_mode, because I now believe that it's a separate issue with a separate implementation. - Now that I understand better your philosophy on configurations, and for the strict mode implementation I'll study
register_configuration_parametersome more and try to propose a new PR to make you happy.
Does this plan sound good?
As for naming the new class, maybe you could call it ExperimentalS3UriPath so that nobody confuses it for a stable feature. Then you could decide on the proper behavior of S3Path.as_uri without worrying that anyone becomes dependent on ExperimentalS3UriPath in the meantime.
Hey @maresb, @liormizr,
Could you guys summarize your final thoughts into a quick list or something? I can try and work on a PR for this if you're interested. I'd like the interoperability with Pandas, I'm just a bit confused on where we landed :D
Thanks for the great library, -Seth
Hi @four43, thanks so much for volunteering!
It's been a while and I can't bother at the moment to reread the whole conversation. But I think the real problem is to determine what is the proper behavior in the first place.
IMO, Pandas behaves badly. I'd much rather that Pandas add proper pathlib support so that we don't have to deal with this fspath nonsense. Personally, I want the option to raise an exception when this happens since I relying on some S3Path ACL stuff. However, most people don't have the ACL issues which I do, and for them it makes more sense to play nicely and just return the s3:// URL.
Unfortunately my request to raise an exception somewhat derailed and cluttered this discussion, and I apologize for that.
As for the discussion regarding S3PathFactory, I really don't understand any of that, or the motivation. Maybe it is now unnecessary since I will submit my "strict mode" implementation separately?
Is there anything wrong with simply changing the behavior of S3Path to return s3://bucket/sample.csv (no URL encoding necessary)?
Do you mean str(S3Path.from_uri("s3://bucket/sample.csv")) == "s3://bucket/sample.csv" ?
I will admit I find the additional step of using from_uri() instead of just the constructor kind of annoying in practice. ¯\_(ツ)_/¯
Hmm... I don't see any reason offhand why the S3Path constructor shouldn't detect and handle the s3:// prefix. But I'm not really immersed in the details at the moment.
Hi @four43 thank you for volunteering :-) Let's start from the beginning, is there an issue with Pandas that you see? Is this #59 solves your issue?