cloudpathlib
cloudpathlib copied to clipboard
open(cloud_path) acts on cached local file instead of cloud file
Unfortunately the following code does not behave as expected:
with open(cloud_path, "w") as fp:
fp.write("foo")
(as opposed to using cloud_path.open("w")
which does work as expected).
What will happen is that this will only modify the local cached copy of the file without uploading it to the cloud. Then further attempts to interact with the cloud path will correctly raise OverwriteNewerLocal
exceptions. Because cloud paths have an __fspath__
method (#72) that returns the local file path, the open
function will only interact with that local file path.
Users can still get the correct behavior by using the open
method. However, I've seen a lot of people using regular pathlib, especially those who are still new to it, with open(path, "w")
on their Path
objects, so I expect this may be a common pattern for users of cloudpathlib as well.
The gist of the problem here is that our os.PathLike
support fails in write situations.
Because CloudPath
implements __fspath__
and therefore the os.PathLike
protocol, it can be accepted by anything that supports an os.PathLike
input, which will call __fspath__
and do whatever with the returned filepath. This fine and dandy for read operations, because the local cache file's path is returned and the caller will read that file, and we have the ability to refresh the cache before that happens.
However, the problem happens when the caller writes. We're able to refresh the cache within __fspath__
before returning the local cache's path, but after we return, we're out of that method's scope and can't do anything else. The caller will write to the local cache, but we won't be able to sync the cache back up to the cloud.
This may be impossible to fix. Once __fspath__
returns, we won't be able to do anything. We'd really like __fspath__
to behave like a context manager so we can run some code after, but unfortunately we can't actually do that because the os.PathLike
protocol requires __fspath__
to return bytes
or str
. We'd need the Python standard library to implement a protocol that supports a context manager and then they have to convince all of the other libraries to support their new context manager interface. os.PathLike
has been out since Python 3.6 and there are still lots of places that don't support it yet and still only support filepaths as strings.
If it's the case that this is impossible to fix, then we have two options:
1. Keep our os.PathLike
support but try to detect when it's being used in a write case so we can raise an exception.
@pjbull suggested that we could maybe use the inspect
module to trace the callstack. We could potentially try to detect if it's being used by the built-in open
function and then check whether mode
is set to one of the writable ones.
It's not clear to me we can come up with a solution that is guaranteed to catch all cases though. If some user calls __fspath__
, stores the content as a string, and then writes to that file later, that would be harder to reliably detect. I'm also not sure if the builtins open
is the only way you can write to files in Python or not.
2. Remove __fspath__
and don't implement the os.PathLike
interface.
That means users can't pass cloud paths into functions that expect os.PathLike
inputs, even for reading.
I think (1) is probably worth doing, and that we also may add a section to the docs on writing to the cloud so that we can be very explicit about what we recommend, when you might see errors, and how it could be fixed. Probably with having a library level setting for making this a warning/error.
Here are two other ideas while we're brainstorming:
3. Patch open
I would not do this by default, but I could see having a CLOUDPATHLIB_OVERRIDE_OPEN
env var that is documented and could be set. If it is set at import time, we override the open
builtin (and maybe also os.open
?) With our own version that does the write-mode check (only adds the one check for read scenarios, so shouldn't be a huge perf hit). If we are in write mode, we supply our own object that will upload on __exit__
/.close()
.
I guess the primary use case here is if you have large/complex code bases that you want to pass CloudPath
objects into, but don't want to change all of the open
calls in the existing code.
4. Warn/raise on del if cache file has changed.
We track modified-time of the cache file at the object level. In __del__
, if the on-disk modified time is greater than the one on the object, we can warn or raise.
I could also see an env var here that would enable uploading to the cloud in __del__
if the local cache is newer, but that seems high risk for foot-shooting.
Hi! I'm encountering this issue in a context similar to https://github.com/drivendataorg/cloudpathlib/issues/240, with pandas.
This issue causes a problem with pandas.DataFrame.to_*(path)
, where you can't write to the same file more than once :/
Example:
import pandas as pd
import cloudpathlib
df = pd.DataFrame(data={'col1': [1, 2], 'col2': [3, 4]})
cloud_path = cloudpathlib.GSPath('gs://bucket/df.parquet')
df.to_parquet(cloud_path) # works fine
df.to_parquet(cloud_path) # has OverwriteNewerLocalError exception
---------------------------------------------------------------------------
OverwriteNewerLocalError Traceback (most recent call last)
Input In [22], in <cell line: 1>()
----> 1 df.to_parquet(cloud_path)
File /opt/conda/envs/deconv/lib/python3.10/site-packages/pandas/util/_decorators.py:207, in deprecate_kwarg.<locals>._deprecate_kwarg.<locals>.wrapper(*args, **kwargs)
205 else:
206 kwargs[new_arg_name] = new_arg_value
--> 207 return func(*args, **kwargs)
File /opt/conda/envs/deconv/lib/python3.10/site-packages/pandas/core/frame.py:2835, in DataFrame.to_parquet(self, path, engine, compression, index, partition_cols, storage_options, **kwargs)
2749 """
2750 Write a DataFrame to the binary parquet format.
2751
(...)
2831 >>> content = f.read()
2832 """
2833 from pandas.io.parquet import to_parquet
-> 2835 return to_parquet(
2836 self,
2837 path,
2838 engine,
2839 compression=compression,
2840 index=index,
2841 partition_cols=partition_cols,
2842 storage_options=storage_options,
2843 **kwargs,
2844 )
File /opt/conda/envs/deconv/lib/python3.10/site-packages/pandas/io/parquet.py:420, in to_parquet(df, path, engine, compression, index, storage_options, partition_cols, **kwargs)
416 impl = get_engine(engine)
418 path_or_buf: FilePath | WriteBuffer[bytes] = io.BytesIO() if path is None else path
--> 420 impl.write(
421 df,
422 path_or_buf,
423 compression=compression,
424 index=index,
425 partition_cols=partition_cols,
426 storage_options=storage_options,
427 **kwargs,
428 )
430 if path is None:
431 assert isinstance(path_or_buf, io.BytesIO)
File /opt/conda/envs/deconv/lib/python3.10/site-packages/pandas/io/parquet.py:176, in PyArrowImpl.write(self, df, path, compression, index, storage_options, partition_cols, **kwargs)
172 from_pandas_kwargs["preserve_index"] = index
174 table = self.api.Table.from_pandas(df, **from_pandas_kwargs)
--> 176 path_or_handle, handles, kwargs["filesystem"] = _get_path_or_handle(
177 path,
178 kwargs.pop("filesystem", None),
179 storage_options=storage_options,
180 mode="wb",
181 is_dir=partition_cols is not None,
182 )
183 try:
184 if partition_cols is not None:
185 # writes to multiple files under the given path
File /opt/conda/envs/deconv/lib/python3.10/site-packages/pandas/io/parquet.py:80, in _get_path_or_handle(path, fs, storage_options, mode, is_dir)
70 def _get_path_or_handle(
71 path: FilePath | ReadBuffer[bytes] | WriteBuffer[bytes],
72 fs: Any,
(...)
77 FilePath | ReadBuffer[bytes] | WriteBuffer[bytes], IOHandles[bytes] | None, Any
78 ]:
79 """File handling for PyArrow."""
---> 80 path_or_handle = stringify_path(path)
81 if is_fsspec_url(path_or_handle) and fs is None:
82 fsspec = import_optional_dependency("fsspec")
File /opt/conda/envs/deconv/lib/python3.10/site-packages/pandas/io/common.py:225, in stringify_path(filepath_or_buffer, convert_file_like)
222 return cast(BaseBufferT, filepath_or_buffer)
224 if isinstance(filepath_or_buffer, os.PathLike):
--> 225 filepath_or_buffer = filepath_or_buffer.__fspath__()
226 return _expand_user(filepath_or_buffer)
File /opt/conda/envs/deconv/lib/python3.10/site-packages/cloudpathlib/cloudpath.py:233, in CloudPath.__fspath__(self)
231 def __fspath__(self):
232 if self.is_file():
--> 233 self._refresh_cache(force_overwrite_from_cloud=False)
234 return str(self._local)
File /opt/conda/envs/deconv/lib/python3.10/site-packages/cloudpathlib/cloudpath.py:844, in CloudPath._refresh_cache(self, force_overwrite_from_cloud)
841 # if local newer but not dirty, it was updated
842 # by a separate process; do not overwrite unless forced to
843 if self._local.stat().st_mtime > stats.st_mtime:
--> 844 raise OverwriteNewerLocalError(
845 f"Local file ({self._local}) for cloud path ({self}) is newer on disk, but "
846 f"is being requested for download from cloud. Either (1) push your changes to the cloud, "
847 f"(2) remove the local file, or (3) pass `force_overwrite_from_cloud=True` to "
848 f"overwrite."
849 )
OverwriteNewerLocalError: Local file (/tmp/tmp1bto4kma/<hidden>/df.parquet) for cloud path (gs://<hidden>/df.parquet) is newer on disk, but is being requested for download from cloud. Either (1) push your changes to the cloud, (2) remove the local file, or (3) pass `force_overwrite_from_cloud=True` to overwrite.
Hi @grisaitis. This is indeed another case of the problem documented by this issue—cloud paths do not support being directly called with third-party functions (like pd.to_parquet(cloud_path)
) that generically do writes. Unfortunately, we are not close to being able to support it. Most of the discussion related to this issue involves giving users a clearer error to indicate that this is not supported.
@jayqi thank you so much for the reply.
should i be calling str(cloud_path)
or cloud_path.to_uri()
?
@grisaitis there are ~two~ three directions of workarounds you can take:
Update: See best option (0) in the following comment
(1): Introduce explicit local file paths and then use cloudpathlib to write/upload to the cloud. Some example pseudo-code:
cloud_path = CloudPath(...)
local_path = Path(...)
df.to_parquet(local_path)
cloud_path.upload_from(local_path)
This introduces local paths that you need to define and manage. It is also not the most efficient, because it introduces another copy of the data that uses up disk space, as well as associated extra reads/writes.
(2): Use as_uri()
to leverage other cloud filesystem libraries, like fsspec. This only works if the function you're passing cloud_path.as_uri()
into has some other third-party support and if you have those dependencies installed. It works for pandas because pandas supports fsspec. What happens in this case is that you're using cloudpathlib to potentially manipulate paths (using a pathlib-like syntax) but using an fsspec backend (s3fs, gcsfs, etc.), which is an entirely independent library, for the actual write to the cloud.
Good news! @pjbull pointed out to me another option, which is probably the best for cases that support it.
(0): For write functions that support getting file objects/handles as inputs, you can using cloud_path.open("w")
. pandas does indeed support this. The syntax here is a little clunky but it has the advantage over the other two options of not involving an extra copy of the file and leveraging purely cloudpathlib on the backend.
See example:
from cloudpathlib import CloudPath
import pandas as pd
cloud_path = CloudPath("s3://cloudpathlib-test-bucket/test.csv")
cloud_path.exists()
#> False
df = pd.DataFrame({"x": [1, 2, 3], "y": [4, 5, 6]})
with cloud_path.open("w") as fp:
df.to_csv(fp)
cloud_path.exists()
#> True
cloud_path.read_text()
#> ',x,y\n0,1,4\n1,2,5\n2,3,6\n'
Created at 2022-09-15 16:53:55 EDT by reprexlite v0.5.0
@grisaitis
Thanks!
Since I want to avoid having any extra clutter around just calling df.to_*(path)
, I might try out https://github.com/fsspec/universal_pathlib as well
You could start an asynchronous file watcher (eg in a thread) that detects changes whenever __fspath__
is called. It would have to know when the changes come from cloupath vs externally
Related issue in the standard library: https://github.com/python/cpython/issues/99818
Another work around I've found is to use the context manage to open the CloudPath and do the write operation as normal.
I'm working with pyarrow.parquet
, and the write_table
method only accepts a str
or pyarrow.NativeFile
. This is the example I've ended up with
target_filepath = CloudPath("gs://some_bucket/target_file.parquet")
target_filepath.exists() # False
with target_filepath.open("w"):
parquet.write_table(table, target_filepath)
target_filepath.exists() # True
Whoa, I don't think I've considered that usage pattern. Thanks @FullMetalMeowchemist! It's really interesting that that works, given that we didn't (I don't think, anyways) do it intentionally. I think it makes sense, given that CloudPath.open
patches the returned context manager to upload the local file to the cloud.
@pjbull I'm wondering if we do something like: on calls to fspath
, check whether we have a buffer context manager active, and error if not? Then we could potentially enforce the pattern of
with cloud_path.open():
whatever_write(cloud_path)
and error if fspath
is called without being inside the context manager.
I guess it may still be undesirably cumbersome for read cases though to require a context manager, and this doesn't help with detecting the difference between read and write usage.