fastparquet
fastparquet copied to clipboard
write to buffer support
For some reason, I need to write parquet file content to buffer(io.BytesIO), but seams like this package will close file-object after writing always, for example:
data = [{"x": 1, "y": 2}, {"x": 2, "y": 3}]
df = pd.DataFrame.from_records(data)
buffer = io.BytesIO()
df.to_parquet(buffer, engine="fastparquet")
print(buffer.closed)
buffer.getvalue()
code result would be like:
True
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
Input In [46], in <cell line: 7>()
5 df.to_parquet(buffer, engine="fastparquet")
6 print(buffer.closed)
----> 7 buffer.getvalue()
ValueError: I/O operation on closed file.
I think it would be better if fastparquet allow more options that control when to close the file-object
And for anyone run into this situation, you can use another package like pyarrow
I agree, it is reasonable that fastparquet should not close a file-like object if that has been passed. It should not be hard to code - would you like a go?
On the other hand, it's also pretty easy to make a file-like object that cannot be closed
class UnclosableBytesIO(io.BytesIO):
def close(self):
pass
buffer = UnclosableBytesIO()
This indeed seems like inconsistent behavior across engines:
>>> import io
>>> import pandas as pd
>>>
>>> data = [{"x": 1, "y": 2}, {"x": 2, "y": 3}]
>>> df = pd.DataFrame.from_records(data)
>>>
>>> buffer = io.BytesIO()
>>> df.to_parquet(buffer, engine="fastparquet")
>>> print(buffer.closed)
True
>>> buffer.getvalue()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ValueError: I/O operation on closed file.
vs
>>> buffer2 = io.BytesIO()
>>> df.to_parquet(buffer2) # using the default pyarrow
>>> print(buffer2.closed)
False
>>> buffer2.getvalue()
b'PAR1\x15\x04\x15 ...
Thanks for the workaround @martindurant.