aws-sdk-pandas icon indicating copy to clipboard operation
aws-sdk-pandas copied to clipboard

Cannot write parquet in an S3 path that includes white spaces when using ray

Open andreaschiappacasse opened this issue 10 months ago • 1 comments

When using ray, trying to write a parquet file to a path that includes spaces an error is returned. i.e. ArrowInvalid: Expected a local filesystem path, got a URI: 's3://mybucket/path with space/table/a=a/' This is particularly annoying since the same happens if a partition col includes values with spaces.

How to Reproduce

import awswrangler as wr
wr.engine.set("ray")
import pandas as pd


data = pd.DataFrame({'a':['a','b','c'],'b':['a - 1','b - 1','c - 1']}, columns=['a','b'])


wr.s3.to_parquet(
    df=data,
    path=F"s3://mybucket/path with space/table",
    dataset=True,
    mode="overwrite_partitions",
    partition_cols=['a'],
    database='mydb',
    table="test_arrow",
)

Expected behavior

The write should be successful, as with the python engine.

OS

Windows

Python version

3.10

AWS SDK for pandas version

3.3.0

Additional context

pyarrow 13.0.0 ray 2.3.0

andreaschiappacasse avatar Apr 23 '24 16:04 andreaschiappacasse

This seems like a problem related to the PyArrow filesystem resolution. I have opened https://github.com/apache/arrow/issues/41365 with them

jaidisido avatar Apr 24 '24 09:04 jaidisido

Marking this issue as stale due to inactivity. This helps our maintainers find and focus on the active issues. If this issue receives no comments in the next 7 days it will automatically be closed.

github-actions[bot] avatar Jun 23 '24 12:06 github-actions[bot]