fastparquet icon indicating copy to clipboard operation
fastparquet copied to clipboard

write to buffer support

Open strongbugman opened this issue 1 year ago • 2 comments

For some reason, I need to write parquet file content to buffer(io.BytesIO), but seams like this package will close file-object after writing always, for example:

data = [{"x": 1, "y": 2}, {"x": 2, "y": 3}]
df = pd.DataFrame.from_records(data)
buffer = io.BytesIO()

df.to_parquet(buffer, engine="fastparquet")
print(buffer.closed)
buffer.getvalue() 

code result would be like:

True
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Input In [46], in <cell line: 7>()
      5 df.to_parquet(buffer, engine="fastparquet")
      6 print(buffer.closed)
----> 7 buffer.getvalue()

ValueError: I/O operation on closed file.

I think it would be better if fastparquet allow more options that control when to close the file-object

And for anyone run into this situation, you can use another package like pyarrow

strongbugman avatar Jun 06 '23 01:06 strongbugman

I agree, it is reasonable that fastparquet should not close a file-like object if that has been passed. It should not be hard to code - would you like a go?

On the other hand, it's also pretty easy to make a file-like object that cannot be closed

class UnclosableBytesIO(io.BytesIO):
    def close(self):
        pass

buffer = UnclosableBytesIO()

martindurant avatar Jun 08 '23 17:06 martindurant

This indeed seems like inconsistent behavior across engines:

>>> import io
>>> import pandas as pd
>>>
>>> data = [{"x": 1, "y": 2}, {"x": 2, "y": 3}]
>>> df = pd.DataFrame.from_records(data)
>>> 
>>> buffer = io.BytesIO()
>>> df.to_parquet(buffer, engine="fastparquet")
>>> print(buffer.closed)
True
>>> buffer.getvalue() 
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ValueError: I/O operation on closed file.

vs

>>> buffer2 = io.BytesIO()
>>> df.to_parquet(buffer2) # using the default pyarrow
>>> print(buffer2.closed)
False
>>> buffer2.getvalue() 
b'PAR1\x15\x04\x15 ...

Thanks for the workaround @martindurant.

avyfain avatar Oct 01 '23 19:10 avyfain