Fix: Map functions crash on out of bounds cases
Which issue does this PR close?
- Closes https://github.com/apache/datafusion/issues/16187
Rationale for this change
This PR fixes a panic crash in array indexing operations when accessing elements from map_values() results that contain structs with Null-typed fields.
What changes are included in this PR?
datafusion/functions-nested/src/extract.rs: Added null field detection and safe handling
Are these changes tested?
Yes - unit tests are added covering:
- Main crash scenario: Struct with
Null-typed fields - Mixed null/valid fields, out-of-bounds access
Are there any user-facing changes?
No, only behavior change.
@krishvishal are you still planning to wrap this PR up?
@comphead I've changed the implementation a bit to handle nulls properly. Previous fix just outputs NULL for queries like select [named_struct('a', 1, 'b', null)][1]; instead of {a: 1, b: NULL} .
I've also added tests to .array.slt file.
I think its ready.
Apologies for the delay.
Thanks @krishvishal the latest version becomes much more complicated compared to prev one. This can be a subject to check the performance.
What is the reason for adding the specific handler, some tests not passing?
@comphead I've had to add the handler because the previous fix caused wrong behavior.
For example the following query currently returns:
> select [named_struct('a', 1, 'b', null)][1];
+-----------------------------------------------------------------------+
| make_array(named_struct(Utf8("a"),Int64(1),Utf8("b"),NULL))[Int64(1)] |
+-----------------------------------------------------------------------+
| {a: 1, b: NULL} |
+-----------------------------------------------------------------------+
1 row(s) fetched.
Elapsed 0.007 seconds.
This is correct. But my previous fix (commit: d9a699ff7b787bf75da8415431fccf8cd676fa3d) returned NULL. So to fix the behavior while also handling out of bounds indexes, I've had to add this handler.
@comphead I've pushed additional fixes you've asked. I think the PR is ready. Let me know if it is okay.
Thanks @krishvishal let me quickly double check if we can do it with less data massaging. Since this will happen on execution layer it would be called for every batch of data possibly hitting the performance
@comphead, can you please tell if there is something I can do to move this forward? Are there any relevant benches I could either run or adapt for this case?
@comphead, can you please tell if there is something I can do to move this forward? Are there any relevant benches I could either run or adapt for this case?
Sorry I didn't have a chance to look into that, I'm planning to do it this week
Thank you for your contribution. Unfortunately, this pull request is stale because it has been open 60 days with no activity. Please remove the stale label or comment or this will be closed in 7 days.
Closing in favor of #16802