opendal icon indicating copy to clipboard operation
opendal copied to clipboard

python: Implement more "file-like" methods?

Open 3ok opened this issue 1 year ago • 6 comments

While integrating Apache OpenDAL python bindings with PyArrow to read Parquet files, I encountered a limitation where OpenDAL File object didn't implement all expected "file-like" methods.

To work around this issue, I had to create a custom wrapper class to implement missing methods. Here is a simplified example of the workaround.

import io
import opendal

class FileWrapper(io.IOBase):
    def __init__(self, file: opendal.File) -> None:
        self._file = file
    
    def read(self, __size: int | None = None) -> bytes:
        return self._file.read(__size)
    
    def seek(self, offset: int, whence: int = 0) -> int:
        return self._file.seek(offset, whence)
    
    def readable(self) -> bool:  # missing method
        return True

and its usage within PyArrow

import pyarrow.parquet

op = opendal.Operator(
    "fs",
    root=str(pathlib.Path(__file__).parent / "data"),
)

with op.open("example.parquet", mode="rb") as fp:
    table = pyarrow.parquet.read_table(FileWrapper(fp))
    print(table)

Without the wrapper, I get a ValueError: I/O operation on closed file.

I put quotation marks around "file-like" because I couldn't find a very clear and precise definition of what a "file-like" object should look like (even less clear for async context). In this particular case, I tried to go with io.IOBase abstract base class for all I/O classes and implement any remaining methods I may need.

  • Reference Discord thread: https://discord.com/channels/1081052318650339399/1081052319342407715/threads/1211616322857734154

3ok avatar Mar 14 '24 20:03 3ok

Thanks for the proposal. In fact, I was going to open a similar issue before I saw this :)

More specifically, I'm managing to integrate opendal with polars/pandas. With some effort, we have been able to create a DataFrame from opendal.File. Writing a DataFrame to a File, on the other hand, was trickier than I thought.

Take the pandas as an example:

  • The tell method is called before write, but opendal.File's tell method only works in rb mode. This is in conflict with the write method, which does require the wb mode.
  • It require the flush method and closed property in the file-like object.

reswqa avatar Mar 15 '24 04:03 reswqa

  • The tell method is called before write, but opendal.File's tell method only works in rb mode. This is in conflict with the write method, which does require the wb mode.

It's significant that opendal doesn't yet support seek on writer, although I'm considering adding basic support for it.

For pandas, does it perform random writes such as seeking position A, writing, then seeking position B? Or does it simply write continuously and occasionally call tell to know how many data has been written?

It require the flush method and closed property in the file-like object.

This should work for now.

Xuanwo avatar Mar 15 '24 04:03 Xuanwo

For pandas, does it perform random writes such as seeking position A, writing, then seeking position B? Or does it simply write continuously and occasionally call tell to know how many data has been written?

I took a deeper look at related code, and for write operations, it requires the file-like object have the seekable method. If it doesn't, it defaults to seekable = True and gets the position via tell when initialized. As for the csv format, which is really just a continuous write, doesn't require seek. If it is some more complex data format, the possibility of random writes is not excluded, I think. So it's best to support seek on writer in the future.

I added a seekable method that always returns False and a forwarding write method to the FileWrapper class proposed in this issue's description, and I can confirm the DataFrame can indeed be written to a new csv file opened via opendal.

This should work for now.

Are you saying that both of these are supported in the current code base? Or we can support them now.

reswqa avatar Mar 15 '24 06:03 reswqa

If it is some more complex data format, the possibility of random writes is not excluded, I think. So it's best to support seek on writer in the future.

Yes, here are some issues related to services that don't support seek writing, such as S3.

I understand that we can implement seek, but only in very limited scenarios. May vary across different platforms.

So it's best to support seek on writer in the future.

Yep, we can implement for fs at least.

I can confirm the DataFrame can indeed be written to a new csv file opened via opendal.

Good know, let's make it work first.

Are you saying that both of these are supported in the current code base? Or we can support them now.

Yep, flush and close API are already available but not implemented.

Xuanwo avatar Mar 15 '24 06:03 Xuanwo

Oh, I suddenly thought of another problem: Our implementation of File::write do not seem to accord with the requirement of https://docs.python.org/3/tutorial/inputoutput.html#methods-of-file-objects. It needs to explicitly return the number of bytes written.

image

reswqa avatar Mar 15 '24 06:03 reswqa

Update: Fortunately, this is the only change we need to make to support writing the polars DataFrame to opendal.

Thank you very much!

Xuanwo avatar Mar 15 '24 06:03 Xuanwo

File has been implemented, feel free to raise new issues.

Xuanwo avatar Jul 01 '24 11:07 Xuanwo