[BUG] Intermittent bug decompressing files in the parquet reader in 24.08.
This is a complex bug. It was initially seen running NDS benchmarks on H100 machines. We could not get it to reproduce on other machines, although later investigation leads up to believe this probably isn't actually H100 specific, but is more of a timing/race sort of issue that H100 perturbs into happening.
-
During one of the NDS queries, we can get a fairly reproducible failure to decompress a parquet file, which sometimes manifests as an overt exception thrown by the reader "Error during decompression", but also seems to manifest as simply corrupt data, leading the benchmark to fail validation against the CPU.
-
This reproduces in 24.08, but not 24.06. Also, it does not seem to reproduce from a clean local cudf build. Only from a
libcudf.sothat comes out of thespark-rapids-jnienvironment. -
We managed to get a reproducible case with a handful of files that are coalesced by Spark and handed to cudf. See attached.
-
Running a process that has >= 2 threads that spin forever and randomly load from the selection of files, on an H100 machine will produce the exception reasonably quickly. Typically within < 1000 loads per thread (which only takes seconds - these files are tiny).
-
These files are gzip compressed, which is somewhat out of the ordinary. After digging into this quite a bit, the hand-rolled
gpuinflateimplementation in cudf seemed to be a likely candidate. It does some odd things passing messages between warps that seem like they're probably UB. Andcompute-sanitizerdoes complain about races in the kernel. I wasn't able to narrow it all the way down though. -
Instead, I tried swapping
gpuinflateout entirely and just using nvcomp for gzip. With one caveat, this makes the problem go away. The machine that repros the issue remains fine after hundreds of thousands of loads across multiple threads. -
The big caveat from above is that the nvcomp gzip decompressor seems to have explicit trouble with the file
dbgdump586530430.parquet. It quietly fails to decode the first data page, but doesn't produce an error. It seems likely that this is a different issue entirely and I've sent a repro to nvcomp. If I remove this file from the rotation and run the test, there are no issues.
So, while not 100% conclusive, it seems like the gpuinflate() implementation in cudf has some race/timing/synchronization issues which occasionally can cause failures, with H100 being a good way to do it.
I put up a branch with my (very much quick-and-dirty) nvcomp integration. https://github.com/nvdbaranec/cudf/tree/nvcomp_2408_integration
For some reason, github isn't letting me upload the archive (it's only 97k). Ping me on slack if you'd like to see it.
Thank you @nvdbaranec for working on this new integration. Adding gzip support to the nvCOMP adaptor seems like a good option. Does Spark-RAPIDS use LIBCUDF_NVCOMP_POLICY option of STABLE? https://docs.rapids.ai/api/cudf/nightly/user_guide/io/io/#nvcomp-integration
Does Spark-RAPIDS use LIBCUDF_NVCOMP_POLICY option of STABLE ?
Indirectly, yes. We don't set the LIBCUDF_NVCOMP_POLICY environment variable, and STABLE is the default.
The big caveat from above is that the nvcomp gzip decompressor seems to have explicit trouble with the file dbgdump586530430.parquet
@nvdbaranec does this failure still occur with nvCOMP 4.2.0.11 (or 5.0)?
Adding gzip support to the nvCOMP adaptor seems like a good option.
@sameerz if Spark-RAPIDS confirms successful execution of NDS 3K with Parquet GZIP, then we could switch to using nvCOMP by default for these files.