cloudpathlib icon indicating copy to clipboard operation
cloudpathlib copied to clipboard

Pure path classes

Open jayqi opened this issue 5 years ago • 7 comments

pathlib has this notion of pure paths PurePath (PurePosixPath, PureWindowsPath) vs concrete paths Path (PosixPath, WindowsPath).

Path classes are divided between pure paths, which provide purely computational operations without I/O, and concrete paths, which inherit from pure paths but also provide I/O operations.

Pure paths are useful in some special cases; for example:

  1. If you want to manipulate Windows paths on a Unix machine (or vice versa). You cannot instantiate a WindowsPath when running on Unix, but you can instantiate PureWindowsPath.
  2. You want to make sure that your code only manipulates paths without actually accessing the OS. In this case, instantiating one of the pure classes may be useful since those simply don’t have any OS-accessing operations.

Would it be useful for cloudpathlib to have pure paths that let you manipulate paths for a cloud provider without needing to authenticate?

jayqi avatar Aug 19 '20 17:08 jayqi

Here's my take on this. I will say that it is likely colored by trying to instantiate a WindowsPath multiple times on POSIX systems, having that fail, and then having to change things to PureWindowsPath, which I found annoying.

let you manipulate paths for a cloud provider without needing to authenticate

There may be a limited number of scenarios where this could be nice, but 90% of the time I don't think folks are going to need it. I'd rather support that for clients without an explicit "Pure" class.

I like our current setup for a few reasons:

  • We don't have to explain the differences between them to end users and support separate APIs for them.
  • You should be able to do pure path things with the default backend for any of the providers. S3 (and I think also Azure?) don't need to hit the server for anything with what we do in __init__. (I guess we will have to take this into consideration
  • Trying to do something concrete without having the client properly configured will give you an error that probably makes sense (e.g., Access Denied).
  • I've got a general preference for having as few abstractions as possible.
  • If there's a scenario where a dev wants to guarantee the cloud provider won't be hit with any requests, we can provide a way to do this and an example in the docs.

I have a hunch that there are deep, dark corners of operating systems and file systems that made it so even instantiating a WindowsPath on a Posix system could spell disaster. So, it probably made sense for pathlib, but I'm not convinced this additional abstraction provides enough of a benefit to users of cloudpathlib to make it worth the complexity.

pjbull avatar Aug 22 '20 23:08 pjbull

Okay, agreed that this doesn't seem clearly necessary. Luckily, I think if we find from usage that this is valuable in the future, it should be a fairly straightforward change that wouldn't change anything about the interfaces of existing path classes.

jayqi avatar Aug 23 '20 01:08 jayqi

Closing for now unless we hear more of a need from consumers of the package

pjbull avatar Aug 25 '20 04:08 pjbull

Just chiming in with a vote in favour of this feature - the ability to symbolically manipulate cloud-like paths without assuming we actually want to access data at those paths would be useful.

You should be able to do pure path things with the default backend for any of the providers.

Here you're encouraging using a subset of methods as if they were on a PureCloudPath class, but without any marking delineating those methods or guarantee that they are pure. That kind of indicates that having such a separation would be useful.

If there's a scenario where a dev wants to guarantee the cloud provider won't be hit with any requests, we can provide a way to do this and an example in the docs.

How might this work?

(All of that said the current choice is of course a totally reasonable way to limit scope 🙂 )

TomNicholas avatar Aug 26 '25 15:08 TomNicholas

What's the primary benefit you're looking for? Is it not installing dependencies? Is it guaranteeing no network calls? Is it handling arbitrary kinds of paths?

I'd like to avoid the complexity we see in the pathlib class hierarchy. Do runtime errors if it tries to access the network work in your scenario?

For example, I could see doing an implementation like this, which would just work to do simple manipulations without any dependencies installed and raise on any interaciton with the client layer.

from cloudpathlib import CloudPath
from cloudpathlib.cloudpath import CloudImplementation
from cloudpathlib.client import Client


class DummyClient(Client):
    _error_message = "PureCloudPath does not support calls that require a client."
    
    def __init__(self):
        pass

    def __del__(self):
        # override as no-op
        pass

    def __getattr__(self, item):
        raise NotImplementedError(self._error_message)

    def _download_file(self, cloud_path, local_path):
        raise NotImplementedError(self._error_message)
    
    def _exists(self, cloud_path):
        raise NotImplementedError(self._error_message)
    
    def _generate_presigned_url(self, cloud_path, expire_seconds=3600):
        raise NotImplementedError(self._error_message)
    
    def _get_public_url(self, cloud_path):
        raise NotImplementedError(self._error_message)
    
    def _list_dir(self, cloud_path, recursive):
        raise NotImplementedError(self._error_message)
    
    def _move_file(self, src, dst, remove_src=True):
        raise NotImplementedError(self._error_message)
    
    def _remove(self, path, missing_ok=True):
        raise NotImplementedError(self._error_message)
    
    def _upload_file(self, local_path, cloud_path):
        raise NotImplementedError(self._error_message)
    

class PureCloudPath(CloudPath):
    cloud_prefix = ""  # instantiated on init and never changed
    
    def __init__(self, path: str):
        self.cloud_prefix = path.split("://", 1)[0] + "://"
        self._client = DummyClient()
        super().__init__(path)

    @property
    def drive(self):
        return self.parts[1]  # first after anchor
    
    def mkdir(self, parents: bool = False, exist_ok: bool = False):
        return self.client.mkdir(self, parents, exist_ok)
    
    def touch(self, exist_ok: bool = True):
        return self.client.touch(self, exist_ok)
    

pure_cloud_meta = CloudImplementation()
pure_cloud_meta.name = "pure"
pure_cloud_meta._client_class = DummyClient
pure_cloud_meta._path_class = PureCloudPath

PureCloudPath._cloud_meta = pure_cloud_meta

DummyClient.CloudPath = PureCloudPath

pjbull avatar Aug 27 '25 22:08 pjbull

Is it not installing dependencies? Is it guaranteeing no network calls?

Both ideally. The context is a codebase that stores and manipulates bucket paths and prefixes across a range of object storage platforms (including ones outside of S3/GCS/Azure). We're not actually trying to change or read anything in these buckets - that's done by a totally separate layer.

I'm just looking for a lightweight solution to the problem of cloud-like path validation / manipulation without the baggage of actually using the network for anything. I can't find any library that does that - the closest is cloudpathlib, except for the lack of a PureCloudPath.

For example, I could see doing an implementation like this, which would just work to do simple manipulations without any dependencies installed and raise on any interaciton with the client layer.

This is potentially useful, thank you!

TomNicholas avatar Aug 29 '25 17:08 TomNicholas

I'm willing to reopen and consider adding an implementation like the one above if that would work for most use cases.

cloud-like path validation

I will note that we don't do any strict validation (e.g., restrictions on specific characters in bucket names) of cloudpaths, we just assume this is all handled server side or in the client SDK.

pjbull avatar Aug 30 '25 00:08 pjbull