cudf
cudf copied to clipboard
[BUG] double free or memory corruption when parsing some JSON
Describe the bug
We recently saw a SIGSEGV when trying to parse some customer data using JSON. As I stripped down and obfuscated the data it switched to the process aborting with the error.
double free or corruption (!prev)
Aborted (core dumped)
The code to reproduce this is
TEST_F(JsonReaderTest, JsonLinesCrash)
{
std::string const fname = "from_json_1_10_simplified.json";
cudf::io::json_reader_options options =
cudf::io::json_reader_options::builder(cudf::io::source_info{fname})
.lines(true)
.recovery_mode(cudf::io::json_recovery_mode_t::RECOVER_WITH_NULL)
.normalize_single_quotes(true)
.normalize_whitespace(true)
.mixed_types_as_string(true)
.keep_quotes(true);
auto result = cudf::io::read_json(options);
}
and the file is from_json_1_10_simplified.json
Just be aware that I have hit other issues with this data too. IllegalMemoryAccess errors, a hang where it looks like we are in an infinite loop, and exceptions saying that the data looks to be incorrectly formatted (top level arrays mixed with structs appears to make this very unhappy).
I am happy to share more data/failures as needed. It just takes a lot of work to obfuscate it properly and still have an appropriate repro case. I suspect that they are almost all related to each other in some form.
Thank you for the repro and the dataset, @revans2 I'm investigating the segfault and get back to you soon.
I ran via compute-sanitizer with the repro code. but there are no CUDA errors.
I ran with gdb, and got line of segfault.
#16 operator()<int, cudf::io::json::device_json_column> (col=..., i=16, __closure=<synthetic pointer>) at /home/coder/cudf/cpp/src/io/json/json_column.cu:597
597 col.child_columns.clear(); // their references should be deleted too.
If this line is commented out, no segfault occurs.
Minimal repro with valgrind errors
{ "Root": { "Key": [ { "EE": "A" } ] } }
{ "Root": { "Key": { } } }
{ "Root": { "Key": [{ "YY": 1}] } }
Issue is not in the clear() function call, but accessing deleted reference of col. when clear() is called, the children columns are deleted, but their references still are stored in columns. That's what leads to the segfault, it tries to access a deleted object's reference. This algorithm could be rewritten as graphs algorithm to avoid this issue.
Created a fix for memory errors #15798. Long term fix should be rewriting the device tree creation algorithm
Discussed with @shrshi offline about the bug and need for refactor of make_device_json_column() in json_column.cu#L477
make_device_json_column function's complexity increased after addition of more features, and the current algorithm does not use any graph data-structure. It uses vectors to keep track of ignoring columns, mixed types, pruned columns and uses references. This code could be refactored, so it will be easy to add new features too.
Here are the list of features that is currently supported by make_device_json_column function and few other features that needs to be supported.
- constructs nested tree from parent_col_id vector
- Handle array of arrays too (pandas feature) (eg.
[[1, 2, 3], [4, 5, 6]]) - null literals are ignored (pandas)
- mixed types support. (spark)
- current implementation returns mixed types as strings. @revans2 clarified that mixed types need to return requested type and ignore all other types. (Eg. if list is requested, all struct, string, values needs to be ignored for same column path).
- Request any column as string type and corresponding children columns should be ignored (no extra memory, no extra processing). (spark)
- Column pruning
- TODO: optimization to batch the
init_to_zerocall withcub::DeviceMemcpy::Batched(consider batching scan of offsets too). - OPTIONAL: To support map type. (Need more details about requirements from spark team).