cudf icon indicating copy to clipboard operation
cudf copied to clipboard

[BUG] fix memory errors in cudf pytest

Open karthikeyann opened this issue 3 years ago • 1 comments
trafficstars

Describe the bug Following python tests produced errors when run using compute-sanitizer --tool memcheck 102 errors.

Steps/Code to reproduce bug

compute-sanitizer --tool memcheck pytest -v python/cudf/cudf/tests/test_gpu_arrow_parser.py
compute-sanitizer --tool memcheck pytest -v python/cudf/cudf/tests/test_csv.py::test_na_filter_empty_fields
compute-sanitizer --tool memcheck pytest -v python/cudf/cudf/tests/test_sparse_df.py::test_fillna
compute-sanitizer --tool memcheck pytest -v python/cudf/cudf/tests/test_sparse_df.py::test_reading_arrow_sparse_data
compute-sanitizer --tool memcheck pytest -v python/cudf/cudf/tests/test_parquet.py::test_parquet_reader_basic[snappy0-1-columns0-cudf]
compute-sanitizer --tool memcheck pytest -v python/cudf/cudf/tests/test_parquet.py::test_parquet_reader_basic[snappy0-1-columns1-cudf] 
compute-sanitizer --tool memcheck pytest -v python/cudf/cudf/tests/test_s3.py::test_read_parquet[True-None-None-32]
compute-sanitizer --tool memcheck pytest -v python/cudf/cudf/tests/test_orc.py::test_orc_reader_basic[TestOrcFile.testSnappy.orc-None-False-cudf]
compute-sanitizer --tool memcheck pytest -v python/cudf/cudf/tests/test_avro_reader_fastavro_integration.py::test_avro_compression[snappy-1]

Expected behavior There should be 0 errors from compute sanitizer

Additional context memcheck_errors.log attached Some of these errors may arise from dependent libraries such as nvcomp, arrow_cuda.

Reference: https://github.com/rapidsai/cudf/issues/904#issuecomment-1168230773

karthikeyann avatar Jul 04 '22 10:07 karthikeyann

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

github-actions[bot] avatar Aug 05 '22 17:08 github-actions[bot]

Some of these tests do not exist now. Some of the tests do not produce memory errors anymore. Only the following test in the above list produces memory error. time compute-sanitizer --tool memcheck pytest -v python/cudf/cudf/tests/test_csv.py::test_na_filter_empty_fields

karthikeyann avatar Mar 13 '23 13:03 karthikeyann

@vuule do you know who would be a good person to dig into that error?

If there is only one error left then perhaps we are ready to work on enabling memcheck in nightlies for pytests?

vyasr avatar Mar 15 '23 14:03 vyasr

I trace this down to this line: https://github.com/rapidsai/cudf/blob/9fbc249ce8be1bd515edb91fb768ecad870885aa/python/cudf/cudf/tests/test_csv.py#L2061

The output from the compute-sanitizer includes this:

cudf/tests/test_csv.py::test_na_filter_empty_fields ========= Invalid __global__ read of size 1 bytes
=========     at 0xd20 in cudf::io::csv::gpu::data_type_detection(cudf::io::parse_options_view, cudf::device_span<const char, (unsigned long)18446744073709551615>, cudf::device_span<const unsigned char, (unsigned long)18446744073709551615>, cudf::device_span<const unsigned long, (unsigned long)18446744073709551615>, cudf::device_span<cudf::io::column_type_histogram, (unsigned long)18446744073709551615>)
=========     by thread (0,0,0) in block (0,0,0)
=========     Address 0x7fc216e05a0a is out of bounds
=========     and is 3 bytes after the nearest allocation at 0x7fc216e05a00 of size 8 bytes
...

As the message says, the address that is out of bounds is right after 0x7fc216e05a00 which is the opts.trie_na device pointer from the first parameter in this call: https://github.com/rapidsai/cudf/blob/9fbc249ce8be1bd515edb91fb768ecad870885aa/cpp/src/io/csv/csv_gpu.cu#L170 The device_span at opts.trie_na reports a size of 1 and the sizeof(serial_trie_node)==4.

The following line is where the trie_na is passed to be read: https://github.com/rapidsai/cudf/blob/9fbc249ce8be1bd515edb91fb768ecad870885aa/cpp/src/io/csv/csv_gpu.cu#L202

The serialized_trie_contains() function is defined here: https://github.com/rapidsai/cudf/blob/9fbc249ce8be1bd515edb91fb768ecad870885aa/cpp/src/io/utilities/trie.cuh#L83-L84

And somewhere in here is the 1 byte read past the end of the buffer. I'm not able to follow the code there very well. The trie_na device buffer is created here: https://github.com/rapidsai/cudf/blob/9fbc249ce8be1bd515edb91fb768ecad870885aa/cpp/src/io/csv/reader_impl.cu#L883

davidwendt avatar Mar 24 '23 16:03 davidwendt

Thank you for digging into this @davidwendt This one is definitely for me to fix :) I should be able to work on this some time next week.

vuule avatar Mar 24 '23 19:03 vuule

Opened https://github.com/rapidsai/cudf/pull/13011 with a fix. Targeted the PR to 23.06 as we are already in burndown, but could be convinced to merge into 23.04 :)

vuule avatar Mar 25 '23 07:03 vuule

@vuule is this issue ready to close?

GregoryKimball avatar Aug 18 '23 04:08 GregoryKimball