spark-rapids icon indicating copy to clipboard operation
spark-rapids copied to clipboard

[BUG] CudfColumnSizeOverflowException in populating a column of input file name

Open jihoonson opened this issue 1 month ago • 6 comments

Describe the bug GpuInputFileName.columnarEval populates a string column with a scalar value which is the input file path. Since the file path length can vary, this can throw the CudfColumnSizeOverflowException if inputFilePath.getLength() * batch.numRows() exceeds the cudf column size limit. The file path was about 200 bytes long when I saw this error in my testing.

Steps/Code to reproduce bug

Place your data set in a directory where its path length is significantly long and read it. Your data set should be reasonably large as well so that inputFilePath.getLength() * batch.numRows() exceeds the cudf column size limit.

Expected behavior

The plugin should be able to handle this automatically by adjusting the batch size based on the file path length.

jihoonson avatar Dec 02 '25 00:12 jihoonson