cudf [BUG] Truncated dataframe when reading parquet from s3fs

Describe the bug When using read_parquet to retrieve data from an on-prem object store, the resulting dataframe is truncated. In my example, only 29999 rows appear when the data contains actually 73049 rows (as confirmed by pandas).

By using the option 'use_python_file_object=False,' in read_parquet, the issue goes away.

The cudf version also emits a series of warnings that don't appear when using the same code through vanilla pandas.

/opt/conda/envs/rapids/lib/python3.9/site-packages/fsspec/caching.py:503: UserWarning: Read is outside the known file parts: (0, 4). IO/caching performance may be poor!
  warnings.warn(
/opt/conda/envs/rapids/lib/python3.9/site-packages/fsspec/caching.py:503: UserWarning: Read is outside the known file parts: (778649, 778657). IO/caching performance may be poor!
  warnings.warn(
/opt/conda/envs/rapids/lib/python3.9/site-packages/fsspec/caching.py:503: UserWarning: Read is outside the known file parts: (773547, 778649). IO/caching performance may be poor!
  warnings.warn(

Steps/Code to reproduce bug

if socket.gethostname() == "localdf":
    import pandas as pd
else:
    import cudf as pd

storage_opts = {'client_kwargs': {'endpoint_url': ENDPOINT_URL}}
date_df = pd.read_parquet("s3://" + BUCKETPATH + 'date_dim.dat/', storage_options=storage_opts)

print(date_df.info())

Expected behavior I expect the full dataframe to be loaded.

Environment overview (please complete the following information)

Using docker image:

FROM rapidsai/rapidsai-core:22.06-cuda11.5-base-ubuntu20.04-py3.9

I tried s3fs version 0.4.2 and 2022.3.0 with same results and fsspec 2022.5.0

Jul 21 '22 19:07 joshuarobinson

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

Aug 20 '22 20:08 github-actions[bot]

Hi @joshuarobinson, I encountered the same issue, but I was using dask-cudf. Did you find a solution or have more information about the topic?

Sep 05 '22 14:09 elado-alon

Hi @elado-alon The following change worked as a work-around for me:

import cudf as pd
read_args = {'use_python_file_object': False}
catalog_sales_df = pd.read_parquet("s3://" + BUCKETPATH + 'catalog_sales.dat/', **read_args, storage_options=storage_opts)

cc @rjzamora

Sep 05 '22 14:09 joshuarobinson

I will try to reproduce this today. Note that use_python_file_object=False is probably fine if you are reading the entire file, but may be quite slow if you are performing partial IO (selecting specific columns and/or row-groups)

Sep 06 '22 14:09 rjzamora

I was able to reproduce the truncation bug with cudf (using the 'date_dim.dat' dataset shared with me offline by @joshuarobinson). I will try to figure out what is going wrong here.

I was not able to reproduce the IO/caching warning with rapidsai/rapidsai-core:22.06-cuda11.5-base-ubuntu20.04-py3.9 (or with a more-recent version of cudf and fsspec). I suspect that the IO/caching warning is just a result of an old fsspec and/or s3fs package, but I'll be happy to dig deeper if the warning can be reproduced with a specific cudf/fsspec/s3fs combination that I can test.

Note that I was not able to reproduce the truncation bug or the IO/caching warning with dask_cudf:

In [3]: len(dask_cudf.read_parquet(path))
Out[3]: 73049

In [4]: len(cudf.read_parquet(path))
Out[4]: 29999

@elado-alon - Are you observing the truncation bug, the IO/caching warning, or both? Can you share the specific fsspec/s3fs versions you are using?

Sep 06 '22 15:09 rjzamora

Update: The truncated-dataframe result is indeed a cudf bug, and I submitted a fix in #11655

Sep 07 '22 00:09 rjzamora

@rjzamora I had both the truncation bug and the IO/caching warning. The versions are s3fs==2022.7.1, fsspec==2022.7.1.

Sep 07 '22 10:09 elado-alon

@joshuarobinson and @elado-alon - https://github.com/rapidsai/cudf/pull/11655 was just merged into branch-22.10. Please let me know if you still see an IO/caching warning. If not, we should be able to close this issue.

Sep 09 '22 14:09 rjzamora

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

Oct 09 '22 15:10 github-actions[bot]

Please feel free to re-open if the issue is not solved. Thank you @rjzamora for your contribution.

Oct 21 '22 06:10 GregoryKimball