python: Implement more "file-like" methods?
While integrating Apache OpenDAL python bindings with PyArrow to read Parquet files, I encountered a limitation where OpenDAL File object didn't implement all expected "file-like" methods.
To work around this issue, I had to create a custom wrapper class to implement missing methods. Here is a simplified example of the workaround.
import io
import opendal
class FileWrapper(io.IOBase):
def __init__(self, file: opendal.File) -> None:
self._file = file
def read(self, __size: int | None = None) -> bytes:
return self._file.read(__size)
def seek(self, offset: int, whence: int = 0) -> int:
return self._file.seek(offset, whence)
def readable(self) -> bool: # missing method
return True
and its usage within PyArrow
import pyarrow.parquet
op = opendal.Operator(
"fs",
root=str(pathlib.Path(__file__).parent / "data"),
)
with op.open("example.parquet", mode="rb") as fp:
table = pyarrow.parquet.read_table(FileWrapper(fp))
print(table)
Without the wrapper, I get a ValueError: I/O operation on closed file.
I put quotation marks around "file-like" because I couldn't find a very clear and precise definition of what a "file-like" object should look like (even less clear for async context). In this particular case, I tried to go with io.IOBase abstract base class for all I/O classes and implement any remaining methods I may need.
- Reference Discord thread: https://discord.com/channels/1081052318650339399/1081052319342407715/threads/1211616322857734154
Thanks for the proposal. In fact, I was going to open a similar issue before I saw this :)
More specifically, I'm managing to integrate opendal with polars/pandas. With some effort, we have been able to create a DataFrame from opendal.File. Writing a DataFrame to a File, on the other hand, was trickier than I thought.
Take the pandas as an example:
- The
tellmethod is called beforewrite, butopendal.File's tell method only works inrbmode. This is in conflict with thewritemethod, which does require thewbmode. - It require the
flushmethod andclosedproperty in the file-like object.
- The
tellmethod is called beforewrite, butopendal.File's tell method only works inrbmode. This is in conflict with thewritemethod, which does require thewbmode.
It's significant that opendal doesn't yet support seek on writer, although I'm considering adding basic support for it.
For pandas, does it perform random writes such as seeking position A, writing, then seeking position B? Or does it simply write continuously and occasionally call tell to know how many data has been written?
It require the flush method and closed property in the file-like object.
This should work for now.
For pandas, does it perform random writes such as seeking position A, writing, then seeking position B? Or does it simply write continuously and occasionally call tell to know how many data has been written?
I took a deeper look at related code, and for write operations, it requires the file-like object have the seekable method. If it doesn't, it defaults to seekable = True and gets the position via tell when initialized. As for the csv format, which is really just a continuous write, doesn't require seek. If it is some more complex data format, the possibility of random writes is not excluded, I think. So it's best to support seek on writer in the future.
I added a seekable method that always returns False and a forwarding write method to the FileWrapper class proposed in this issue's description, and I can confirm the DataFrame can indeed be written to a new csv file opened via opendal.
This should work for now.
Are you saying that both of these are supported in the current code base? Or we can support them now.
If it is some more complex data format, the possibility of random writes is not excluded, I think. So it's best to support seek on writer in the future.
Yes, here are some issues related to services that don't support seek writing, such as S3.
I understand that we can implement seek, but only in very limited scenarios. May vary across different platforms.
So it's best to support seek on writer in the future.
Yep, we can implement for fs at least.
I can confirm the
DataFramecan indeed be written to a new csv file opened via opendal.
Good know, let's make it work first.
Are you saying that both of these are supported in the current code base? Or we can support them now.
Yep, flush and close API are already available but not implemented.
Oh, I suddenly thought of another problem: Our implementation of File::write do not seem to accord with the requirement of https://docs.python.org/3/tutorial/inputoutput.html#methods-of-file-objects. It needs to explicitly return the number of bytes written.
Update: Fortunately, this is the only change we need to make to support writing the polars DataFrame to opendal.
Thank you very much!
File has been implemented, feel free to raise new issues.