fastparquet icon indicating copy to clipboard operation
fastparquet copied to clipboard

TypeError when writing to S3 with partition_cols

Open tammymendt opened this issue 4 years ago • 4 comments

The issue can be reproduced as follows:

import pandas as pd

df = pd.DataFrame([
    [1, 'DE', 2.3],
    [2, 'BE', 4.5],
    [3, 'DE', 7.6],
    [4, 'DE', 4.8]
], columns=['id', 'country', 'value'])

df.to_parquet('s3://<my-s3-bucket>/<my-directory>', compression='gzip', index=False, engine='fastparquet', partition_cols=['country'])

When doing the same write operation, without the partition_cols argument, it works fine. The error stacktrace is the following:

Traceback (most recent call last):
  File "<my-python-file>.py", line 20, in <module>
    engine='fastparquet')
  File "pandas/util/_decorators.py", line 214, in wrapper
    return func(*args, **kwargs)
  File "pandas/core/frame.py", line 2116, in to_parquet
    **kwargs,
  File "pandas/io/parquet.py", line 264, in to_parquet
    **kwargs,
  File "pandas/io/parquet.py", line 185, in write
    **kwargs,
  File "fastparquet/writer.py", line 895, in write
    fn = join_path(filename, '_metadata')
  File "fastparquet/util.py", line 330, in join_path
    if path[0][0] == '/':
TypeError: 'S3File' object is not subscriptable

The code assumes the path[0] variable is a string, but it is an S3File object. For the S3File object, the path string can be accessed using .path. Thus is should look as follows path[0].path[0].

The package versions are:

fastparquet==0.4.0
packaging==20.3
pandas==1.0.1
s3fs==0.4.2

tammymendt avatar May 18 '20 13:05 tammymendt

Can you please cross-post on pandas? fastparquet certainly does handle doing this, so apparently the call is being made incorrectly, but I'm not sure exactly how.

(cc https://github.com/pandas-dev/pandas/issues/33452 )

martindurant avatar May 18 '20 18:05 martindurant

So pandas seems to assume that the first argument to the api.write function can either be a path or a buffer. In the case of an S3 file, it passes an S3File object (buffer), not the string of the filepath. Here is the function that does this (https://github.com/pandas-dev/pandas/blob/master/pandas/io/s3.py#L23). I think this behavior is intended though.

However, the write function in fastparquet expects a filename (https://github.com/dask/fastparquet/blob/master/fastparquet/writer.py#L764). The write_simple function works fine with both a filepath and a File object (https://github.com/dask/fastparquet/blob/master/fastparquet/writer.py#L735). But the rest of the logic in the write function relies on the argument being a string.

Ideally, I suppose pandas should pass an argument to write which is always a same type of object with the same interface (so even when its just a string, it should be wrapped by some class). This way the write function in fastparquet would not have to handle paths and buffers differently. I assume a change like this in pandas would likely break other parts of that code, since the get_filepath_or_buffer function is used quite a lot in pandas (https://github.com/pandas-dev/pandas/search?p=1&q=get_filepath_or_buffer&unscoped_q=get_filepath_or_buffer).

tammymendt avatar May 19 '20 08:05 tammymendt

I believe this should now be fixed in at least pandas master (but probably released too).

martindurant avatar Sep 08 '20 13:09 martindurant

@martindurant cool thanks, I will check and if its fixed I'll close the issue.

tammymendt avatar Sep 16 '20 07:09 tammymendt