bug: in some unusual cases python binding slower than ossfs
Describe the bug
Related to #4901 During investigation of the above issue, we found that OpenDAL's Python binding will be slower than ossfs when writing Polars dataframe to a parquet file when using Alibaba Cloud OSS public endpoint in some cases.
Steps to Reproduce
- Code to reproduce:
import opendal
import polars as pl
import ossfs
import time
from io import BytesIO
import pyarrow.dataset as ds
oss_config = {
"endpoint": "oss-cn-shenzhen.aliyuncs.com",
"key": "my-key",
"secret": "my-secret",
"bucket": "opendal-test",
}
class myOperator:
def __init__(self) -> None:
self.op = opendal.Operator(
"oss",
bucket=oss_config["bucket"],
endpoint=oss_config["endpoint"],
access_key_id=oss_config["key"],
access_key_secret=oss_config["secret"],
)
def write(
self,
df: pl.DataFrame,
file_name: str,
) -> None:
# Open and write the DataFrame to the file.
with self.op.open(file_name, mode="wb") as file:
# df.write_parquet(file) # raise warning: Polars found a filename. Ensure you pass a path to the file instead of a python file object when possible for best performance.
buffer = BytesIO()
df.write_parquet(buffer)
file.write(buffer.getvalue())
def read(
self,
file_name: str,
) -> None:
# Open and read the DataFrame from the file.
with self.op.open(file_name, mode="rb") as file:
read_df = pl.read_parquet(file) # raise error: TypeError: 'opendal.File' object is not iterable
print(f"read_df: {read_df.head(5)}")
class myOSSFileSystem:
def __init__(self) -> None:
self.fs = ossfs.OSSFileSystem(
bucket=oss_config["bucket"],
endpoint=oss_config["endpoint"],
key=oss_config["key"],
secret=oss_config["secret"],
default_cache_type="none",
)
def write(
self,
df: pl.DataFrame,
file_name: str,
) -> None:
# Open and write the DataFrame to the file.
with self.fs.open(f"{oss_config['bucket']}/{file_name}", mode="wb") as file:
# df.write_parquet(file)
buffer = BytesIO()
df.write_parquet(buffer)
file.write(buffer.getvalue())
def read(
self,
file_name: str,
) -> None:
# Open and read the DataFrame from the file.
with self.fs.open(f"{oss_config['bucket']}/{file_name}", mode="rb") as file:
read_df = pl.read_parquet(file)
print(f"read_df: {read_df.tail(5)}")
def read_ds(
self,
file_name: str,
) -> None:
dataset = ds.dataset(
source=f"{oss_config['bucket']}/{file_name}", format="parquet", filesystem=self.fs
)
pl.scan_pyarrow_dataset(dataset).collect()
df = pl.DataFrame(
{
"a": [1, 2, 3, 4, 5],
"b": [5, 4, 3, 2, 1],
}
)
# df = pl.read_parquet("file_large.parquet")
# compare time cost of writing with opendal and ossfs
# write 10 times and calculate the average time cost print the result
total_start = time.time()
for i in range(10):
start = time.time()
op = myOperator()
op.write(df, "file_op.parquet")
print(f"opendal write cost: {time.time() - start}")
print(f"average opendal write cost: {(time.time() - total_start) / 10}")
total_start2 = time.time()
for i in range(10):
start2 = time.time()
fs = myOSSFileSystem()
fs.write(df, "file_fs.parquet")
print(f"ossfs write cost: {time.time() - start2}")
print(f"average ossfs write cost: {(time.time() - total_start2) / 10}")
- Environment and dependencies:
Use OSS public endpoint and public network, internal endpoint seems normal.
opendal==0.45.6
ossfs==2023.12.0
Python 3.12.4
Expected Behavior
Time cost about the same.
Additional Context
It is possible that this is not a bug or that ossfs(or official oss2 SDK) did some tricks, advisable to conduct further investigation.
Are you willing to submit a PR to fix this bug?
- [X] Yes, I would like to submit a PR.
Hi, @sysu-yunz, are you interested in this?
Hi, @sysu-yunz, are you interested in this?
I am interested in contributing, but I must admit that I am not very familiar with Rust. I would like to understand more about the bug and the context in which it occurs before committing to a fix. Additionally, I have been quite busy lately, which may limit my availability to dive into this issue right away.
If anyone else is interested in collaborating on this issue, I would be happy to work together to find a solution.
I am interested in contributing, but I must admit that I am not very familiar with Rust. I would like to understand more about the bug and the context in which it occurs before committing to a fix.
Thank you for that!
If anyone else is interested in collaborating on this issue, I would be happy to work together to find a solution.
cc @Zheaoli who is interested.
Not related to opendal and no actions to take so far, let's close.