aws-sdk-pandas
aws-sdk-pandas copied to clipboard
Cannot write parquet in an S3 path that includes white spaces when using ray
When using ray, trying to write a parquet file to a path that includes spaces an error is returned.
i.e.
ArrowInvalid: Expected a local filesystem path, got a URI: 's3://mybucket/path with space/table/a=a/'
This is particularly annoying since the same happens if a partition col includes values with spaces.
How to Reproduce
import awswrangler as wr
wr.engine.set("ray")
import pandas as pd
data = pd.DataFrame({'a':['a','b','c'],'b':['a - 1','b - 1','c - 1']}, columns=['a','b'])
wr.s3.to_parquet(
df=data,
path=F"s3://mybucket/path with space/table",
dataset=True,
mode="overwrite_partitions",
partition_cols=['a'],
database='mydb',
table="test_arrow",
)
Expected behavior
The write should be successful, as with the python engine.
OS
Windows
Python version
3.10
AWS SDK for pandas version
3.3.0
Additional context
pyarrow 13.0.0 ray 2.3.0
This seems like a problem related to the PyArrow filesystem resolution. I have opened https://github.com/apache/arrow/issues/41365 with them
Marking this issue as stale due to inactivity. This helps our maintainers find and focus on the active issues. If this issue receives no comments in the next 7 days it will automatically be closed.