pandas-path icon indicating copy to clipboard operation
pandas-path copied to clipboard

Using pandas-path on the right side of `/` with cloudpathlib

Open ejm714 opened this issue 1 year ago • 8 comments

I'm trying to create a dataframe with two column, one that has the original filepath and one that has the destination filepath so that I can iterate over the dataframe and do something like

for row in df.itertuples():
   row.original_filepath.copy(row.final_filepath)

These paths will be s3 paths.

However, when I go to set up the filepaths with chaining, the second slash in s3:// gets dropped, making the result an invalid s3path

> S3_DATA_DIR / "raw" 
S3Path('s3://mybucket/data/raw')

> str(S3_DATA_DIR / "raw")
's3://mybucket/data/raw'

> str(S3_DATA_DIR / "raw") / df.audio_string.path
0        s3:/mybucket/data/r...
1        s3:/mybucket/data/r...

ejm714 avatar Oct 21 '24 19:10 ejm714

What is df.audio_string.path? Are both s3 paths and does this happen just using cloudpathlib paths?

This comment may be helpful in showing why that happens: https://github.com/drivendataorg/cloudpathlib/issues/469#issuecomment-2328966157

pjbull avatar Oct 21 '24 19:10 pjbull

@ejm714 pandas-path doesn't support cloudpathlib (edit:) out-of-the-box, and the behavior you're observing is correct behavior for regular paths. When it converts the singleton string "s3://mybucket/data/raw", it resolves the repeated slash into a single one, which is expected behavior for Posix paths.

from pathlib import Path

Path("s3://mybucket/data/raw")
#> PosixPath('s3:/mybucket/data/raw')

jayqi avatar Oct 21 '24 19:10 jayqi

Also, see here for cloupathlib support: https://github.com/drivendataorg/pandas-path?tab=readme-ov-file#custom-path-accessors

pjbull avatar Oct 21 '24 19:10 pjbull

What is df.audio_string.path?

df.audio_string is a series with dtype object. I was trying to join an s3path with a series and following the instructions in limitation 3 because we have a path object on the left hand side of a join

The same behavior happens if S3_DATA_DIR is a CloudPath instead of S3Path

I don't see a way that the custom path accessor helps here but let me know if I'm missing something. I think a join with an s3 path on the left side is just an unsupported case

ejm714 avatar Oct 21 '24 19:10 ejm714

I think this is a cloudpathlib bug where we we should raise NotImplemented rather than TypeError so __rtruediv__ gets tried as well see Python docs for explanation.

import pandas as pd
from pandas_path import path
from cloudpathlib import CloudPath

CloudPath("s3://bucket/path") / pd.Series(["a", "b", "c"]).path
#> Traceback (most recent call last):
#>   File "<string>", line 1, in <module>
#>   File "/Users/bull/miniconda3/envs/sandbox/lib/python3.11/site-packages/cloudpathlib/cloudpath.py", line 891, in __truediv__
#>     raise TypeError(f"Can only join path {repr(self)} with strings or posix paths.")
#> TypeError: Can only join path S3Path('s3://bucket/path') with strings or posix paths.

pjbull avatar Oct 21 '24 21:10 pjbull

Ironic because the bug in Limitation 3 has been fixed for many years as of Python 3.8 and we made a mistake of a similar conceptual vein in cloudpathlib.

jayqi avatar Oct 21 '24 21:10 jayqi

I would expect the below to work portably (if cloudpathlib was working correctly). @pjbull's example would still break on a Windows computer.

edited per below comment to use PurePosixPath instead of PosixPath

from pathlib import PurePosixPath

import pandas as pd
from pandas_path import path, register_path_accessor
from cloudpathlib import S3Path

register_path_accessor("pure_posix_path", PurePosixPath)

S3Path("s3://bucket/dir") / pd.Series(['a', 'b', 'c']).pure_posix_path
#> Traceback (most recent call last):
#>   File "<string>", line 1, in <module>
#>   File "/Users/jqi/Downloads/pp-test/.venv/lib/python3.12/site-packages/cloudpathlib/cloudpath.py", line 891, in __truediv__
#>     raise TypeError(f"Can only join path {repr(self)} with strings or posix paths.")
#> TypeError: Can only join path S3Path('s3://bucket/dir') with strings or posix paths.

jayqi avatar Oct 21 '24 21:10 jayqi

Good point. I believe that would also break on windows so should be PurePosixPath.

pjbull avatar Oct 21 '24 22:10 pjbull