opendal icon indicating copy to clipboard operation
opendal copied to clipboard

bug: in some unusual cases python binding slower than ossfs

Open sysu-yunz opened this issue 1 year ago • 3 comments

Describe the bug

Related to #4901 During investigation of the above issue, we found that OpenDAL's Python binding will be slower than ossfs when writing Polars dataframe to a parquet file when using Alibaba Cloud OSS public endpoint in some cases.

Steps to Reproduce

  • Code to reproduce:
import opendal
import polars as pl
import ossfs
import time
from io import BytesIO
import pyarrow.dataset as ds

oss_config = {
    "endpoint": "oss-cn-shenzhen.aliyuncs.com",
    "key": "my-key",
    "secret": "my-secret",
    "bucket": "opendal-test",
}


class myOperator:
    def __init__(self) -> None:
        self.op = opendal.Operator(
            "oss",
            bucket=oss_config["bucket"],
            endpoint=oss_config["endpoint"],
            access_key_id=oss_config["key"],
            access_key_secret=oss_config["secret"],
        )

    def write(
        self,
        df: pl.DataFrame,
        file_name: str,
    ) -> None:
        # Open and write the DataFrame to the file.
        with self.op.open(file_name, mode="wb") as file:
            # df.write_parquet(file)  # raise warning: Polars found a filename. Ensure you pass a path to the file instead of a python file object when possible for best performance.
            buffer = BytesIO()
            df.write_parquet(buffer)
            file.write(buffer.getvalue())

    def read(
        self,
        file_name: str,
    ) -> None:
        # Open and read the DataFrame from the file.
        with self.op.open(file_name, mode="rb") as file:
            read_df = pl.read_parquet(file) # raise error: TypeError: 'opendal.File' object is not iterable
            print(f"read_df: {read_df.head(5)}")


class myOSSFileSystem:
    def __init__(self) -> None:
        self.fs = ossfs.OSSFileSystem(
            bucket=oss_config["bucket"],
            endpoint=oss_config["endpoint"],
            key=oss_config["key"],
            secret=oss_config["secret"],
            default_cache_type="none",
        )

    def write(
        self,
        df: pl.DataFrame,
        file_name: str,
    ) -> None:
        # Open and write the DataFrame to the file.
        with self.fs.open(f"{oss_config['bucket']}/{file_name}", mode="wb") as file:
            # df.write_parquet(file)
            buffer = BytesIO()
            df.write_parquet(buffer)
            file.write(buffer.getvalue())

    def read(
        self,
        file_name: str,
    ) -> None:
        # Open and read the DataFrame from the file.
        with self.fs.open(f"{oss_config['bucket']}/{file_name}", mode="rb") as file:
            read_df = pl.read_parquet(file)
            print(f"read_df: {read_df.tail(5)}")

    def read_ds(
        self,
        file_name: str,
    ) -> None:
        dataset = ds.dataset(
            source=f"{oss_config['bucket']}/{file_name}", format="parquet", filesystem=self.fs
        )
        pl.scan_pyarrow_dataset(dataset).collect()


df = pl.DataFrame(
    {
        "a": [1, 2, 3, 4, 5],
        "b": [5, 4, 3, 2, 1],
    }
)

# df = pl.read_parquet("file_large.parquet")


# compare time cost of writing with opendal and ossfs
# write 10 times and calculate the average time cost print the result

total_start = time.time()
for i in range(10):
    start = time.time()
    op = myOperator()
    op.write(df, "file_op.parquet")
    print(f"opendal write cost: {time.time() - start}")
print(f"average opendal write cost: {(time.time() - total_start) / 10}")

total_start2 = time.time()
for i in range(10):
    start2 = time.time()
    fs = myOSSFileSystem()
    fs.write(df, "file_fs.parquet")
    print(f"ossfs write cost: {time.time() - start2}")
print(f"average ossfs write cost: {(time.time() - total_start2) / 10}")

  • Environment and dependencies:

Use OSS public endpoint and public network, internal endpoint seems normal.

opendal==0.45.6
ossfs==2023.12.0

Python 3.12.4

Expected Behavior

Time cost about the same.

Additional Context

It is possible that this is not a bug or that ossfs(or official oss2 SDK) did some tricks, advisable to conduct further investigation.

Are you willing to submit a PR to fix this bug?

  • [X] Yes, I would like to submit a PR.

sysu-yunz avatar Jul 16 '24 12:07 sysu-yunz

Hi, @sysu-yunz, are you interested in this?

Xuanwo avatar Sep 09 '24 14:09 Xuanwo

Hi, @sysu-yunz, are you interested in this?

I am interested in contributing, but I must admit that I am not very familiar with Rust. I would like to understand more about the bug and the context in which it occurs before committing to a fix. Additionally, I have been quite busy lately, which may limit my availability to dive into this issue right away.

If anyone else is interested in collaborating on this issue, I would be happy to work together to find a solution.

sysu-yunz avatar Sep 09 '24 14:09 sysu-yunz

I am interested in contributing, but I must admit that I am not very familiar with Rust. I would like to understand more about the bug and the context in which it occurs before committing to a fix.

Thank you for that!

If anyone else is interested in collaborating on this issue, I would be happy to work together to find a solution.

cc @Zheaoli who is interested.

Xuanwo avatar Sep 09 '24 15:09 Xuanwo

Not related to opendal and no actions to take so far, let's close.

Xuanwo avatar Jan 09 '25 09:01 Xuanwo