fastparquet
fastparquet copied to clipboard
TypeError when writing to S3 with partition_cols
The issue can be reproduced as follows:
import pandas as pd
df = pd.DataFrame([
[1, 'DE', 2.3],
[2, 'BE', 4.5],
[3, 'DE', 7.6],
[4, 'DE', 4.8]
], columns=['id', 'country', 'value'])
df.to_parquet('s3://<my-s3-bucket>/<my-directory>', compression='gzip', index=False, engine='fastparquet', partition_cols=['country'])
When doing the same write operation, without the partition_cols
argument, it works fine. The error stacktrace is the following:
Traceback (most recent call last):
File "<my-python-file>.py", line 20, in <module>
engine='fastparquet')
File "pandas/util/_decorators.py", line 214, in wrapper
return func(*args, **kwargs)
File "pandas/core/frame.py", line 2116, in to_parquet
**kwargs,
File "pandas/io/parquet.py", line 264, in to_parquet
**kwargs,
File "pandas/io/parquet.py", line 185, in write
**kwargs,
File "fastparquet/writer.py", line 895, in write
fn = join_path(filename, '_metadata')
File "fastparquet/util.py", line 330, in join_path
if path[0][0] == '/':
TypeError: 'S3File' object is not subscriptable
The code assumes the path[0]
variable is a string, but it is an S3File object. For the S3File object, the path string can be accessed using .path
. Thus is should look as follows path[0].path[0]
.
The package versions are:
fastparquet==0.4.0
packaging==20.3
pandas==1.0.1
s3fs==0.4.2
Can you please cross-post on pandas? fastparquet certainly does handle doing this, so apparently the call is being made incorrectly, but I'm not sure exactly how.
(cc https://github.com/pandas-dev/pandas/issues/33452 )
So pandas seems to assume that the first argument to the api.write
function can either be a path or a buffer. In the case of an S3 file, it passes an S3File object (buffer), not the string of the filepath. Here is the function that does this (https://github.com/pandas-dev/pandas/blob/master/pandas/io/s3.py#L23). I think this behavior is intended though.
However, the write function in fastparquet
expects a filename (https://github.com/dask/fastparquet/blob/master/fastparquet/writer.py#L764). The write_simple
function works fine with both a filepath and a File object (https://github.com/dask/fastparquet/blob/master/fastparquet/writer.py#L735). But the rest of the logic in the write function relies on the argument being a string.
Ideally, I suppose pandas should pass an argument to write
which is always a same type of object with the same interface (so even when its just a string, it should be wrapped by some class). This way the write function in fastparquet
would not have to handle paths and buffers differently. I assume a change like this in pandas would likely break other parts of that code, since the get_filepath_or_buffer
function is used quite a lot in pandas (https://github.com/pandas-dev/pandas/search?p=1&q=get_filepath_or_buffer&unscoped_q=get_filepath_or_buffer).
I believe this should now be fixed in at least pandas master (but probably released too).
@martindurant cool thanks, I will check and if its fixed I'll close the issue.