pandas-path
pandas-path copied to clipboard
Using pandas-path on the right side of `/` with cloudpathlib
I'm trying to create a dataframe with two column, one that has the original filepath and one that has the destination filepath so that I can iterate over the dataframe and do something like
for row in df.itertuples():
row.original_filepath.copy(row.final_filepath)
These paths will be s3 paths.
However, when I go to set up the filepaths with chaining, the second slash in s3:// gets dropped, making the result an invalid s3path
> S3_DATA_DIR / "raw"
S3Path('s3://mybucket/data/raw')
> str(S3_DATA_DIR / "raw")
's3://mybucket/data/raw'
> str(S3_DATA_DIR / "raw") / df.audio_string.path
0 s3:/mybucket/data/r...
1 s3:/mybucket/data/r...
What is df.audio_string.path? Are both s3 paths and does this happen just using cloudpathlib paths?
This comment may be helpful in showing why that happens: https://github.com/drivendataorg/cloudpathlib/issues/469#issuecomment-2328966157
@ejm714 pandas-path doesn't support cloudpathlib (edit:) out-of-the-box, and the behavior you're observing is correct behavior for regular paths. When it converts the singleton string "s3://mybucket/data/raw", it resolves the repeated slash into a single one, which is expected behavior for Posix paths.
from pathlib import Path
Path("s3://mybucket/data/raw")
#> PosixPath('s3:/mybucket/data/raw')
Also, see here for cloupathlib support: https://github.com/drivendataorg/pandas-path?tab=readme-ov-file#custom-path-accessors
What is df.audio_string.path?
df.audio_string is a series with dtype object. I was trying to join an s3path with a series and following the instructions in limitation 3 because we have a path object on the left hand side of a join
The same behavior happens if S3_DATA_DIR is a CloudPath instead of S3Path
I don't see a way that the custom path accessor helps here but let me know if I'm missing something. I think a join with an s3 path on the left side is just an unsupported case
I think this is a cloudpathlib bug where we we should raise NotImplemented rather than TypeError so __rtruediv__ gets tried as well see Python docs for explanation.
import pandas as pd
from pandas_path import path
from cloudpathlib import CloudPath
CloudPath("s3://bucket/path") / pd.Series(["a", "b", "c"]).path
#> Traceback (most recent call last):
#> File "<string>", line 1, in <module>
#> File "/Users/bull/miniconda3/envs/sandbox/lib/python3.11/site-packages/cloudpathlib/cloudpath.py", line 891, in __truediv__
#> raise TypeError(f"Can only join path {repr(self)} with strings or posix paths.")
#> TypeError: Can only join path S3Path('s3://bucket/path') with strings or posix paths.
Ironic because the bug in Limitation 3 has been fixed for many years as of Python 3.8 and we made a mistake of a similar conceptual vein in cloudpathlib.
I would expect the below to work portably (if cloudpathlib was working correctly). @pjbull's example would still break on a Windows computer.
edited per below comment to use PurePosixPath instead of PosixPath
from pathlib import PurePosixPath
import pandas as pd
from pandas_path import path, register_path_accessor
from cloudpathlib import S3Path
register_path_accessor("pure_posix_path", PurePosixPath)
S3Path("s3://bucket/dir") / pd.Series(['a', 'b', 'c']).pure_posix_path
#> Traceback (most recent call last):
#> File "<string>", line 1, in <module>
#> File "/Users/jqi/Downloads/pp-test/.venv/lib/python3.12/site-packages/cloudpathlib/cloudpath.py", line 891, in __truediv__
#> raise TypeError(f"Can only join path {repr(self)} with strings or posix paths.")
#> TypeError: Can only join path S3Path('s3://bucket/dir') with strings or posix paths.
Good point. I believe that would also break on windows so should be PurePosixPath.