cudf
cudf copied to clipboard
[BUG] fix memory errors in cudf pytest
Describe the bug
Following python tests produced errors when run using compute-sanitizer --tool memcheck
102 errors.
Steps/Code to reproduce bug
compute-sanitizer --tool memcheck pytest -v python/cudf/cudf/tests/test_gpu_arrow_parser.py
compute-sanitizer --tool memcheck pytest -v python/cudf/cudf/tests/test_csv.py::test_na_filter_empty_fields
compute-sanitizer --tool memcheck pytest -v python/cudf/cudf/tests/test_sparse_df.py::test_fillna
compute-sanitizer --tool memcheck pytest -v python/cudf/cudf/tests/test_sparse_df.py::test_reading_arrow_sparse_data
compute-sanitizer --tool memcheck pytest -v python/cudf/cudf/tests/test_parquet.py::test_parquet_reader_basic[snappy0-1-columns0-cudf]
compute-sanitizer --tool memcheck pytest -v python/cudf/cudf/tests/test_parquet.py::test_parquet_reader_basic[snappy0-1-columns1-cudf]
compute-sanitizer --tool memcheck pytest -v python/cudf/cudf/tests/test_s3.py::test_read_parquet[True-None-None-32]
compute-sanitizer --tool memcheck pytest -v python/cudf/cudf/tests/test_orc.py::test_orc_reader_basic[TestOrcFile.testSnappy.orc-None-False-cudf]
compute-sanitizer --tool memcheck pytest -v python/cudf/cudf/tests/test_avro_reader_fastavro_integration.py::test_avro_compression[snappy-1]
Expected behavior There should be 0 errors from compute sanitizer
Additional context memcheck_errors.log attached Some of these errors may arise from dependent libraries such as nvcomp, arrow_cuda.
Reference: https://github.com/rapidsai/cudf/issues/904#issuecomment-1168230773
This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.
Some of these tests do not exist now. Some of the tests do not produce memory errors anymore.
Only the following test in the above list produces memory error.
time compute-sanitizer --tool memcheck pytest -v python/cudf/cudf/tests/test_csv.py::test_na_filter_empty_fields
@vuule do you know who would be a good person to dig into that error?
If there is only one error left then perhaps we are ready to work on enabling memcheck in nightlies for pytests?
I trace this down to this line: https://github.com/rapidsai/cudf/blob/9fbc249ce8be1bd515edb91fb768ecad870885aa/python/cudf/cudf/tests/test_csv.py#L2061
The output from the compute-sanitizer includes this:
cudf/tests/test_csv.py::test_na_filter_empty_fields ========= Invalid __global__ read of size 1 bytes
========= at 0xd20 in cudf::io::csv::gpu::data_type_detection(cudf::io::parse_options_view, cudf::device_span<const char, (unsigned long)18446744073709551615>, cudf::device_span<const unsigned char, (unsigned long)18446744073709551615>, cudf::device_span<const unsigned long, (unsigned long)18446744073709551615>, cudf::device_span<cudf::io::column_type_histogram, (unsigned long)18446744073709551615>)
========= by thread (0,0,0) in block (0,0,0)
========= Address 0x7fc216e05a0a is out of bounds
========= and is 3 bytes after the nearest allocation at 0x7fc216e05a00 of size 8 bytes
...
As the message says, the address that is out of bounds is right after 0x7fc216e05a00 which is the opts.trie_na device pointer from the first parameter in this call: https://github.com/rapidsai/cudf/blob/9fbc249ce8be1bd515edb91fb768ecad870885aa/cpp/src/io/csv/csv_gpu.cu#L170
The device_span at opts.trie_na reports a size of 1 and the sizeof(serial_trie_node)==4.
The following line is where the trie_na is passed to be read: https://github.com/rapidsai/cudf/blob/9fbc249ce8be1bd515edb91fb768ecad870885aa/cpp/src/io/csv/csv_gpu.cu#L202
The serialized_trie_contains() function is defined here: https://github.com/rapidsai/cudf/blob/9fbc249ce8be1bd515edb91fb768ecad870885aa/cpp/src/io/utilities/trie.cuh#L83-L84
And somewhere in here is the 1 byte read past the end of the buffer. I'm not able to follow the code there very well.
The trie_na device buffer is created here: https://github.com/rapidsai/cudf/blob/9fbc249ce8be1bd515edb91fb768ecad870885aa/cpp/src/io/csv/reader_impl.cu#L883
Thank you for digging into this @davidwendt This one is definitely for me to fix :) I should be able to work on this some time next week.
Opened https://github.com/rapidsai/cudf/pull/13011 with a fix. Targeted the PR to 23.06 as we are already in burndown, but could be convinced to merge into 23.04 :)
@vuule is this issue ready to close?