spark icon indicating copy to clipboard operation
spark copied to clipboard

[SPARK-53873][SQL] ExplodeBase.eval IterableOnce directly on input

Open asautins opened this issue 2 months ago • 3 comments

What changes were proposed in this pull request?

This PR proposes changing the value returned from ExplodeBase.eval from Array[InternalRow] to IterableOnce[IternalRow]. Array converts to an IterableOnce with an implicit conversion. This change proposes directly returning an IterableOnce[InternalRow] object that is a thin wrapper around input, avoiding Array[InternalRow] allocation and population.

By eliminating the intermediate array the method is changed from O(N) to O(1).

Why are the changes needed?

I wouldn't say they are needed, however the current implementation creates an intermediate array that is populated and then iterated. The creation and population of the array would be eliminated with this PR.

Does this PR introduce any user-facing change?

This PR has no user facing changes.

How was this patch tested?

Github Actions

Was this patch authored or co-authored using generative AI tooling?

The creation of this patch was assisted by Microsoft Copilot. The initial change was made using Agent mode.

asautins avatar Oct 11 '25 18:10 asautins

What's the purpose of attempting to avoid this implicit conversion? Is it perf? If so lets see some perf numbers. I'm a little cautious about the repeated calling to inputMap keyArray and valueArray in the maptype case, my gut says this might actually be worse.

I'm not sure I understand avoiding an implicit conversion.

The purpose of the change is to remove creating and populating of an array when returning an iterator would suffice. This changes the eval method from O(n) to O(1) as well as removes the array allocation.

In regard to implicit conversion are you referring to how the array is currently populated with the foreach method for both the ArrayData and MapData types? These methods are members of ArrayData and MapData respectively. Accessing the underlying keyArray and valueArray are done using the same get method as used in the foreach method. The same is done for accessing the ArrayData.

asautins avatar Nov 25 '25 17:11 asautins

The implicit conversion being avoided was from your PR description. So the goal is to remove an intermediary array allocation, and now that you've removed the repeated calls which were introduced in the first version (likely allocating more arrays than before) I can see this potentially being better. We still alloc an array for the input elems but we avoid an array for the eval result correct?

holdenk avatar Nov 25 '25 18:11 holdenk

The implicit conversion being avoided was from your PR description. So the goal is to remove an intermediary array allocation, and now that you've removed the repeated calls which were introduced in the first version (likely allocating more arrays than before) I can see this potentially being better. We still alloc an array for the input elems but we avoid an array for the eval result correct?

Yea, the description may not be the best. If I look at it sideways it says what I want, but clearly not that obvious.

But yes, the goal is to remove the creating and populating of the intermediate array.

In regard to allocating an array for the input elements, the input is processed with Generator.eval which returns an IterableOnce. So depending on the implementation of Generator.eval it may be an array or not. Is that what you are looking at?

asautins avatar Nov 25 '25 18:11 asautins