Occasional `RuntimeError` in `cudf.read_parquet` with kvikio backend with remote data
I'll occasionally see a RuntimeError when using cudf.read_parquet to read a parquet file from S3.
I'll grab a full traceback next time I see one, but here's part of one:
# File "/home/ubuntu/miniforge3/envs/kvikio-env/lib/python3.12/site-packages/cudf/io/parquet.py", line 1280, in _read_parquet
# tbl_w_meta = plc.io.parquet.read_parquet(options)
# ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
# File "parquet.pyx", line 309, in pylibcudf.io.parquet.read_parquet
# File "parquet.pyx", line 324, in pylibcudf.io.parquet.read_parquet
# RuntimeError: CUDF failure at:/opt/conda/conda-bld/work/cpp/src/io/parquet/reader_impl_preprocess.cu:590: Parquet header parsing failed with code(s) 0x5. With unsupported encodings found:
I've also seen
# RuntimeError: CUDF failure at:/opt/conda/conda-bld/work/cpp/src/io/parquet/reader_impl_preprocess.cu:314: Parquet header parsing failed with code(s) while counting page headers 0x5
At first glance, this looks a bit like some incomplete read from blob storage. Perhaps kvikio or cudf did a .read(nbytes) but less than nbytes were returned?
I'll try to get a more reproducible example and some more debug output.
Here's another one from a .read() (not using cudf but presumably cudf eventually calls read). It looks more like a DNS issue
File "/home/rapids/remote-io-benchmark/remote_io_benchmark/s3.py", line 595, in read_one
rf.read(buf)
^^^^^^^^^^^
File "/opt/conda/envs/remote-io-benchmark/lib/python3.12/site-packages/kvikio/remote_file.py", line 182, in read
return self.pread(buf, size, file_offset).get()
^^^^^^^^^^^^^^^^^
File "/opt/conda/envs/remote-io-benchmark/lib/python3.12/site-packages/kvikio/cufile.py", line 54, in get
return self._handle.get()
^^^^^^^^^^^^^^^^^
File "future.pyx", line 33, in kvikio._lib.future.IOFuture.get
RuntimeError: curl_easy_perform() error near /opt/conda/conda-bld/work/cpp/src/remote_handle.cpp:353(Could not resolve host: kvikiobench-33622.s3.us-east-1.amazonaws.com)
(I realize now that this and #601 are closely related, since both will likely involve retries. #601 I think will rely on the HTTP status code when the HTTP request completes. This might be more involved).
And another one in read_parquet on a dask worker:
Exception: "RuntimeError('CUDF failure at: /opt/conda/conda-bld/work/cpp/src/io/parquet/reader_impl_chunking.cu:1044: Encountered malformed parquet page data (row count mismatch in page data)')"
Traceback: ' File "/home/ubuntu/.conda/envs/remote-io-benchmark/lib/python3.12/site-packages/dask/dataframe/io/parquet/core.py", line 97, in __call__
return read_parquet_part(
^^^^^^^^^^^^^^^^^^
File "/home/ubuntu/.conda/envs/remote-io-benchmark/lib/python3.12/site-packages/dask/dataframe/io/parquet/core.py", line 648, in read_parquet_part
func(
File "/home/ubuntu/.conda/envs/remote-io-benchmark/lib/python3.12/site-packages/dask_cudf/_legacy/io/parquet.py", line 280, in read_partition
cls._read_paths(
File "/home/ubuntu/.conda/envs/remote-io-benchmark/lib/python3.12/site-packages/dask_cudf/_legacy/io/parquet.py", line 124, in _read_paths
raise err
File "/home/ubuntu/.conda/envs/remote-io-benchmark/lib/python3.12/site-packages/dask_cudf/_legacy/io/parquet.py", line 94, in _read_paths
df = cudf.read_parquet(
^^^^^^^^^^^^^^^^^^
File "/home/ubuntu/.conda/envs/remote-io-benchmark/lib/python3.12/site-packages/cudf/utils/performance_tracking.py", line 51, in wrapper
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/home/ubuntu/.conda/envs/remote-io-benchmark/lib/python3.12/site-packages/cudf/io/parquet.py", line 911, in read_parquet
df = _parquet_to_frame(
^^^^^^^^^^^^^^^^^^
File "/home/ubuntu/.conda/envs/remote-io-benchmark/lib/python3.12/site-packages/cudf/utils/performance_tracking.py", line 51, in wrapper
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/home/ubuntu/.conda/envs/remote-io-benchmark/lib/python3.12/site-packages/cudf/io/parquet.py", line 1059, in _parquet_to_frame
return _read_parquet(
^^^^^^^^^^^^^^
File "/home/ubuntu/.conda/envs/remote-io-benchmark/lib/python3.12/site-packages/cudf/utils/performance_tracking.py", line 51, in wrapper
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/home/ubuntu/.conda/envs/remote-io-benchmark/lib/python3.12/site-packages/cudf/io/parquet.py", line 1280, in _read_parquet
tbl_w_meta = plc.io.parquet.read_parquet(options)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "parquet.pyx", line 309, in pylibcudf.io.parquet.read_parquet
File "parquet.pyx", line 324, in pylibcudf.io.parquet.read_parquet
Another one:
Traceback (most recent call last):
File "/opt/conda/envs/remote-io-benchmark/bin/kvikiobench", line 8, in <module>
sys.exit(main())
^^^^^^
File "/home/rapids/remote-io-benchmark/remote_io_benchmark/benchmark.py", line 159, in main
asyncio.run(amain())
File "/opt/conda/envs/remote-io-benchmark/lib/python3.12/asyncio/runners.py", line 194, in run
return runner.run(main)
^^^^^^^^^^^^^^^^
File "/opt/conda/envs/remote-io-benchmark/lib/python3.12/asyncio/runners.py", line 118, in run
return self._loop.run_until_complete(task)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/envs/remote-io-benchmark/lib/python3.12/asyncio/base_events.py", line 686, in run_until_complete
return future.result()
^^^^^^^^^^^^^^^
File "/home/rapids/remote-io-benchmark/remote_io_benchmark/benchmark.py", line 152, in amain
results.append(repeat(func, config=config, n=parsed.n_iter))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/rapids/remote-io-benchmark/remote_io_benchmark/_framework.py", line 107, in repeat
runs.append(func(config))
^^^^^^^^^^^^
File "/home/rapids/remote-io-benchmark/remote_io_benchmark/s3.py", line 612, in time_many_large_binary_dask
client.gather(futures)
File "/opt/conda/envs/remote-io-benchmark/lib/python3.12/site-packages/distributed/client.py", line 2566, in gather
return self.sync(
^^^^^^^^^^
File "/home/rapids/remote-io-benchmark/remote_io_benchmark/s3.py", line 595, in read_one
rf.read(buf)
^^^^^^^^^^^
File "/opt/conda/envs/remote-io-benchmark/lib/python3.12/site-packages/kvikio/remote_file.py", line 182, in read
return self.pread(buf, size, file_offset).get()
^^^^^^^^^^^^^^^^^
File "/opt/conda/envs/remote-io-benchmark/lib/python3.12/site-packages/kvikio/cufile.py", line 54, in get
return self._handle.get()
^^^^^^^^^^^^^^^^^
File "future.pyx", line 33, in kvikio._lib.future.IOFuture.get
RuntimeError: curl_easy_perform() error near /opt/conda/conda-bld/work/cpp/src/remote_handle.cpp:353(OpenSSL SSL_read: OpenSSL/3.4.0: error:0A0000C6:SSL routines::packet length too long, errno 0)
I had experienced similar failures that may have been due to gh-763. Of course gh-763 only makes sense if you have multiple projects using kvikio there and that seems slightly doubtful, but maybe still worth a look (e.g. via LD_DEBUG=bindings to see if libkvikio symbols get re-routed).
EDIT: I suppose it's unlikely to have anything to do with this, if remote data is the important part here...
The parquet parsing error is likely caused by a cuDF issue identified in https://github.com/rapidsai/cudf/issues/19586, where KvikIO's exception did not propagate to cuDF. This bug has been fixed by https://github.com/rapidsai/cudf/pull/19628. Now if there is any connection error on the KvikIO side, cuDF should be able to forward the correct exception message.
Does the error could not resolve host occur only in the retry step? Can it be reproduced in 25.10 nightly? If so, I can look into this.
These were sporadic and I never got a concrete, reliable reproducer. I'll go ahead and close this, and we can reopen it if anyone runs into it still.