kvikio icon indicating copy to clipboard operation
kvikio copied to clipboard

Occasional `RuntimeError` in `cudf.read_parquet` with kvikio backend with remote data

Open TomAugspurger opened this issue 11 months ago • 1 comments

I'll occasionally see a RuntimeError when using cudf.read_parquet to read a parquet file from S3.

I'll grab a full traceback next time I see one, but here's part of one:

    #   File "/home/ubuntu/miniforge3/envs/kvikio-env/lib/python3.12/site-packages/cudf/io/parquet.py", line 1280, in _read_parquet
    #     tbl_w_meta = plc.io.parquet.read_parquet(options)
    #                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    #   File "parquet.pyx", line 309, in pylibcudf.io.parquet.read_parquet
    #   File "parquet.pyx", line 324, in pylibcudf.io.parquet.read_parquet
    # RuntimeError: CUDF failure at:/opt/conda/conda-bld/work/cpp/src/io/parquet/reader_impl_preprocess.cu:590: Parquet header parsing failed with code(s) 0x5. With unsupported encodings found:

I've also seen

    # RuntimeError: CUDF failure at:/opt/conda/conda-bld/work/cpp/src/io/parquet/reader_impl_preprocess.cu:314: Parquet header parsing failed with code(s) while counting page headers 0x5

At first glance, this looks a bit like some incomplete read from blob storage. Perhaps kvikio or cudf did a .read(nbytes) but less than nbytes were returned?

I'll try to get a more reproducible example and some more debug output.

TomAugspurger avatar Jan 27 '25 15:01 TomAugspurger

Here's another one from a .read() (not using cudf but presumably cudf eventually calls read). It looks more like a DNS issue

  File "/home/rapids/remote-io-benchmark/remote_io_benchmark/s3.py", line 595, in read_one
    rf.read(buf)
^^^^^^^^^^^
  File "/opt/conda/envs/remote-io-benchmark/lib/python3.12/site-packages/kvikio/remote_file.py", line 182, in read
    return self.pread(buf, size, file_offset).get()
  ^^^^^^^^^^^^^^^^^
  File "/opt/conda/envs/remote-io-benchmark/lib/python3.12/site-packages/kvikio/cufile.py", line 54, in get
    return self._handle.get()
  ^^^^^^^^^^^^^^^^^
  File "future.pyx", line 33, in kvikio._lib.future.IOFuture.get
RuntimeError: curl_easy_perform() error near /opt/conda/conda-bld/work/cpp/src/remote_handle.cpp:353(Could not resolve host: kvikiobench-33622.s3.us-east-1.amazonaws.com)

(I realize now that this and #601 are closely related, since both will likely involve retries. #601 I think will rely on the HTTP status code when the HTTP request completes. This might be more involved).

And another one in read_parquet on a dask worker:

Exception: "RuntimeError('CUDF failure at: /opt/conda/conda-bld/work/cpp/src/io/parquet/reader_impl_chunking.cu:1044: Encountered malformed parquet page data (row count mismatch in page data)')"
Traceback: '  File "/home/ubuntu/.conda/envs/remote-io-benchmark/lib/python3.12/site-packages/dask/dataframe/io/parquet/core.py", line 97, in __call__
    return read_parquet_part(
           ^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/.conda/envs/remote-io-benchmark/lib/python3.12/site-packages/dask/dataframe/io/parquet/core.py", line 648, in read_parquet_part
    func(
  File "/home/ubuntu/.conda/envs/remote-io-benchmark/lib/python3.12/site-packages/dask_cudf/_legacy/io/parquet.py", line 280, in read_partition
    cls._read_paths(
  File "/home/ubuntu/.conda/envs/remote-io-benchmark/lib/python3.12/site-packages/dask_cudf/_legacy/io/parquet.py", line 124, in _read_paths
    raise err
  File "/home/ubuntu/.conda/envs/remote-io-benchmark/lib/python3.12/site-packages/dask_cudf/_legacy/io/parquet.py", line 94, in _read_paths
    df = cudf.read_parquet(
         ^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/.conda/envs/remote-io-benchmark/lib/python3.12/site-packages/cudf/utils/performance_tracking.py", line 51, in wrapper
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/.conda/envs/remote-io-benchmark/lib/python3.12/site-packages/cudf/io/parquet.py", line 911, in read_parquet
    df = _parquet_to_frame(
         ^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/.conda/envs/remote-io-benchmark/lib/python3.12/site-packages/cudf/utils/performance_tracking.py", line 51, in wrapper
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/.conda/envs/remote-io-benchmark/lib/python3.12/site-packages/cudf/io/parquet.py", line 1059, in _parquet_to_frame
    return _read_parquet(
           ^^^^^^^^^^^^^^
  File "/home/ubuntu/.conda/envs/remote-io-benchmark/lib/python3.12/site-packages/cudf/utils/performance_tracking.py", line 51, in wrapper
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/.conda/envs/remote-io-benchmark/lib/python3.12/site-packages/cudf/io/parquet.py", line 1280, in _read_parquet
    tbl_w_meta = plc.io.parquet.read_parquet(options)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "parquet.pyx", line 309, in pylibcudf.io.parquet.read_parquet
  File "parquet.pyx", line 324, in pylibcudf.io.parquet.read_parquet

Another one:

Traceback (most recent call last):
  File "/opt/conda/envs/remote-io-benchmark/bin/kvikiobench", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/home/rapids/remote-io-benchmark/remote_io_benchmark/benchmark.py", line 159, in main
    asyncio.run(amain())
  File "/opt/conda/envs/remote-io-benchmark/lib/python3.12/asyncio/runners.py", line 194, in run
    return runner.run(main)
           ^^^^^^^^^^^^^^^^
  File "/opt/conda/envs/remote-io-benchmark/lib/python3.12/asyncio/runners.py", line 118, in run
    return self._loop.run_until_complete(task)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/envs/remote-io-benchmark/lib/python3.12/asyncio/base_events.py", line 686, in run_until_complete
    return future.result()
           ^^^^^^^^^^^^^^^
  File "/home/rapids/remote-io-benchmark/remote_io_benchmark/benchmark.py", line 152, in amain
    results.append(repeat(func, config=config, n=parsed.n_iter))
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/rapids/remote-io-benchmark/remote_io_benchmark/_framework.py", line 107, in repeat
    runs.append(func(config))
                ^^^^^^^^^^^^
  File "/home/rapids/remote-io-benchmark/remote_io_benchmark/s3.py", line 612, in time_many_large_binary_dask
    client.gather(futures)
  File "/opt/conda/envs/remote-io-benchmark/lib/python3.12/site-packages/distributed/client.py", line 2566, in gather
    return self.sync(
           ^^^^^^^^^^
  File "/home/rapids/remote-io-benchmark/remote_io_benchmark/s3.py", line 595, in read_one
    rf.read(buf)
^^^^^^^^^^^
  File "/opt/conda/envs/remote-io-benchmark/lib/python3.12/site-packages/kvikio/remote_file.py", line 182, in read
    return self.pread(buf, size, file_offset).get()
  ^^^^^^^^^^^^^^^^^
  File "/opt/conda/envs/remote-io-benchmark/lib/python3.12/site-packages/kvikio/cufile.py", line 54, in get
    return self._handle.get()
  ^^^^^^^^^^^^^^^^^
  File "future.pyx", line 33, in kvikio._lib.future.IOFuture.get
RuntimeError: curl_easy_perform() error near /opt/conda/conda-bld/work/cpp/src/remote_handle.cpp:353(OpenSSL SSL_read: OpenSSL/3.4.0: error:0A0000C6:SSL routines::packet length too long, errno 0)

TomAugspurger avatar Jan 28 '25 16:01 TomAugspurger

I had experienced similar failures that may have been due to gh-763. Of course gh-763 only makes sense if you have multiple projects using kvikio there and that seems slightly doubtful, but maybe still worth a look (e.g. via LD_DEBUG=bindings to see if libkvikio symbols get re-routed).

EDIT: I suppose it's unlikely to have anything to do with this, if remote data is the important part here...

seberg avatar Jun 30 '25 06:06 seberg

The parquet parsing error is likely caused by a cuDF issue identified in https://github.com/rapidsai/cudf/issues/19586, where KvikIO's exception did not propagate to cuDF. This bug has been fixed by https://github.com/rapidsai/cudf/pull/19628. Now if there is any connection error on the KvikIO side, cuDF should be able to forward the correct exception message.

Does the error could not resolve host occur only in the retry step? Can it be reproduced in 25.10 nightly? If so, I can look into this.

kingcrimsontianyu avatar Sep 02 '25 03:09 kingcrimsontianyu

These were sporadic and I never got a concrete, reliable reproducer. I'll go ahead and close this, and we can reopen it if anyone runs into it still.

TomAugspurger avatar Sep 17 '25 15:09 TomAugspurger