GH-39565: [C++] Do not concatenate chunked values of fixed-width types to run "array_take"
Rationale for this change
Concatenating a chunked array into a single array before running the array_take kernels is very inefficient and can lead to out-of-memory crashes. See also #25822.
What changes are included in this PR?
- Improvements in the dispatching logic of
TakeMetaFunction("take") to make"array_take"able to have achunked_execkernel for some types - Implementation of kernels for
"array_take"that can receive aChunkedArrayasvaluesand produce an output without concatenating these chunks
Are these changes tested?
By existing tests. Some tests were added in previous PRs that introduced some of the infrastructure to support this.
- GitHub Issue: #39565
May I ask a unrelated question, when would we call assert and when call DCHECK, since I think they would likely to be same?
May I ask a unrelated question, when would we call assert and when call DCHECK, since I think they would likely to be same?
We call assert in headers because we don't want to pay the cost of including logging.h everywhere. Think of assert as lighter-weight debug checks. But if you see an assert in a .cc file tell me to change it to DCHECK*.
Hmm, some of the CI failures are definitely related to this PR (there are crashes when invoking a chunked array take). Could you take a look?
Ok, a first issue is that taking a chunked dictionary array crashes (are there no tests for this?):
>>> a1 = pa.array(["foo", "bar", "foo"]).dictionary_encode()
>>> a2 = pa.array([None, "bar", "bar", "foo"]).dictionary_encode()
>>> a = pa.chunked_array([a1, a2])
>>> a.take([1,4])
/home/antoine/arrow/dev/cpp/src/arrow/array/array_dict.cc:84: Check failed: (data->dictionary) != (nullptr)
Another issue is this reproducer for the C/GLib failures:
>>> i1 = pa.array([1,0], pa.int16())
>>> i2 = pa.array([2], pa.int16())
>>> indices = pa.chunked_array([i1, i2])
>>> pa.array([1,0,2], pa.int16()).take(indices)
<pyarrow.lib.ChunkedArray object at 0x7fa24b1473d0>
[
<Invalid array: Missing values buffer in non-empty fixed-width array>,
<Invalid array: Missing values buffer in non-empty fixed-width array>
]
@pitrou I improved the tests (#43291) and fixed one of the issues you reproduced. Tomorrow I will fix the one related to dictionary arrays (which the modified tests successfully reproduce).
The tests start failing after the Take: Make "array_take" handle CA->C cases by populating VectorKernel::exec_chunked commit.
(I pushed the tests changes here as well, so this PR is expected to fail before I fix the DICTIONARY array issue)
@pitrou this is ready for review again. Passing all tests (which were extended in a previous PR to more carefully test the implementation).
@github-actions crossbow submit -g cpp
It should be noted that this entails significant regression on our micro-benchmarks (when only one thread is working and there's little memory pressure):
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Non-regressions: (14)
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
benchmark baseline contender change % counters
TakeChunkedChunkedStringRandomIndicesWithNulls/524288/1 1.324G items/sec 1.582G items/sec 19.459 {'family_index': 7, 'per_family_instance_index': 3, 'run_name': 'TakeChunkedChunkedStringRandomIndicesWithNulls/524288/1', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 1766, 'null_percent': 100.0}
TakeChunkedChunkedStringRandomIndicesNoNulls/524288/1 225.526M items/sec 239.940M items/sec 6.391 {'family_index': 6, 'per_family_instance_index': 3, 'run_name': 'TakeChunkedChunkedStringRandomIndicesNoNulls/524288/1', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 300, 'null_percent': 100.0}
TakeChunkedChunkedStringRandomIndicesNoNulls/524288/1000 34.689M items/sec 36.379M items/sec 4.871 {'family_index': 6, 'per_family_instance_index': 0, 'run_name': 'TakeChunkedChunkedStringRandomIndicesNoNulls/524288/1000', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 45, 'null_percent': 0.1}
TakeChunkedChunkedStringRandomIndicesNoNulls/524288/10 37.765M items/sec 39.242M items/sec 3.913 {'family_index': 6, 'per_family_instance_index': 1, 'run_name': 'TakeChunkedChunkedStringRandomIndicesNoNulls/524288/10', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 51, 'null_percent': 10.0}
TakeChunkedChunkedStringMonotonicIndices/524288/1 227.141M items/sec 234.760M items/sec 3.354 {'family_index': 8, 'per_family_instance_index': 3, 'run_name': 'TakeChunkedChunkedStringMonotonicIndices/524288/1', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 303, 'null_percent': 100.0}
TakeChunkedChunkedStringMonotonicIndices/524288/0 80.957M items/sec 82.842M items/sec 2.328 {'family_index': 8, 'per_family_instance_index': 4, 'run_name': 'TakeChunkedChunkedStringMonotonicIndices/524288/0', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 108, 'null_percent': 0.0}
TakeChunkedChunkedStringRandomIndicesNoNulls/524288/2 55.847M items/sec 57.044M items/sec 2.143 {'family_index': 6, 'per_family_instance_index': 2, 'run_name': 'TakeChunkedChunkedStringRandomIndicesNoNulls/524288/2', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 74, 'null_percent': 50.0}
TakeChunkedChunkedStringRandomIndicesNoNulls/524288/0 40.181M items/sec 40.949M items/sec 1.912 {'family_index': 6, 'per_family_instance_index': 4, 'run_name': 'TakeChunkedChunkedStringRandomIndicesNoNulls/524288/0', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 54, 'null_percent': 0.0}
TakeChunkedChunkedStringRandomIndicesWithNulls/524288/1000 33.967M items/sec 34.330M items/sec 1.067 {'family_index': 7, 'per_family_instance_index': 0, 'run_name': 'TakeChunkedChunkedStringRandomIndicesWithNulls/524288/1000', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 45, 'null_percent': 0.1}
TakeChunkedChunkedStringRandomIndicesWithNulls/524288/0 40.286M items/sec 39.971M items/sec -0.782 {'family_index': 7, 'per_family_instance_index': 4, 'run_name': 'TakeChunkedChunkedStringRandomIndicesWithNulls/524288/0', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 53, 'null_percent': 0.0}
TakeChunkedChunkedStringMonotonicIndices/524288/2 88.903M items/sec 87.193M items/sec -1.924 {'family_index': 8, 'per_family_instance_index': 2, 'run_name': 'TakeChunkedChunkedStringMonotonicIndices/524288/2', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 117, 'null_percent': 50.0}
TakeChunkedChunkedStringMonotonicIndices/524288/10 71.953M items/sec 70.246M items/sec -2.372 {'family_index': 8, 'per_family_instance_index': 1, 'run_name': 'TakeChunkedChunkedStringMonotonicIndices/524288/10', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 96, 'null_percent': 10.0}
TakeChunkedChunkedStringMonotonicIndices/524288/1000 70.934M items/sec 69.020M items/sec -2.697 {'family_index': 8, 'per_family_instance_index': 0, 'run_name': 'TakeChunkedChunkedStringMonotonicIndices/524288/1000', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 94, 'null_percent': 0.1}
TakeChunkedChunkedStringRandomIndicesWithNulls/524288/2 57.716M items/sec 55.633M items/sec -3.608 {'family_index': 7, 'per_family_instance_index': 2, 'run_name': 'TakeChunkedChunkedStringRandomIndicesWithNulls/524288/2', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 70, 'null_percent': 50.0}
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Regressions: (61)
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
benchmark baseline contender change % counters
TakeChunkedChunkedStringRandomIndicesWithNulls/524288/10 41.015M items/sec 37.718M items/sec -8.039 {'family_index': 7, 'per_family_instance_index': 1, 'run_name': 'TakeChunkedChunkedStringRandomIndicesWithNulls/524288/10', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 54, 'null_percent': 10.0}
TakeChunkedChunkedFSBMonotonicIndices/524288/0/9 286.271M items/sec 221.528M items/sec -22.616 {'family_index': 5, 'per_family_instance_index': 9, 'run_name': 'TakeChunkedChunkedFSBMonotonicIndices/524288/0/9', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 377, 'byte_width': 9.0, 'null_percent': 0.0}
TakeChunkedChunkedFSBMonotonicIndices/524288/1/9 225.478M items/sec 169.862M items/sec -24.666 {'family_index': 5, 'per_family_instance_index': 7, 'run_name': 'TakeChunkedChunkedFSBMonotonicIndices/524288/1/9', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 301, 'byte_width': 9.0, 'null_percent': 100.0}
TakeChunkedChunkedFSBMonotonicIndices/524288/2/9 125.842M items/sec 92.300M items/sec -26.654 {'family_index': 5, 'per_family_instance_index': 5, 'run_name': 'TakeChunkedChunkedFSBMonotonicIndices/524288/2/9', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 168, 'byte_width': 9.0, 'null_percent': 50.0}
TakeChunkedChunkedFSBMonotonicIndices/524288/1000/9 188.734M items/sec 136.580M items/sec -27.634 {'family_index': 5, 'per_family_instance_index': 1, 'run_name': 'TakeChunkedChunkedFSBMonotonicIndices/524288/1000/9', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 252, 'byte_width': 9.0, 'null_percent': 0.1}
TakeChunkedChunkedFSBMonotonicIndices/524288/10/9 165.385M items/sec 119.599M items/sec -27.685 {'family_index': 5, 'per_family_instance_index': 3, 'run_name': 'TakeChunkedChunkedFSBMonotonicIndices/524288/10/9', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 221, 'byte_width': 9.0, 'null_percent': 10.0}
TakeChunkedChunkedInt64MonotonicIndices/524288/2 172.836M items/sec 107.515M items/sec -37.794 {'family_index': 2, 'per_family_instance_index': 2, 'run_name': 'TakeChunkedChunkedInt64MonotonicIndices/524288/2', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 232, 'null_percent': 50.0}
TakeChunkedChunkedFSBMonotonicIndices/524288/2/8 173.764M items/sec 107.624M items/sec -38.063 {'family_index': 5, 'per_family_instance_index': 4, 'run_name': 'TakeChunkedChunkedFSBMonotonicIndices/524288/2/8', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 231, 'byte_width': 8.0, 'null_percent': 50.0}
TakeChunkedChunkedInt64MonotonicIndices/524288/1 348.952M items/sec 214.875M items/sec -38.423 {'family_index': 2, 'per_family_instance_index': 3, 'run_name': 'TakeChunkedChunkedInt64MonotonicIndices/524288/1', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 459, 'null_percent': 100.0}
TakeChunkedChunkedFSBMonotonicIndices/524288/1/8 349.461M items/sec 214.302M items/sec -38.676 {'family_index': 5, 'per_family_instance_index': 6, 'run_name': 'TakeChunkedChunkedFSBMonotonicIndices/524288/1/8', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 469, 'byte_width': 8.0, 'null_percent': 100.0}
TakeChunkedFlatInt64MonotonicIndices/524288/2 186.121M items/sec 110.460M items/sec -40.652 {'family_index': 11, 'per_family_instance_index': 2, 'run_name': 'TakeChunkedFlatInt64MonotonicIndices/524288/2', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 249, 'null_percent': 50.0}
TakeChunkedFlatInt64MonotonicIndices/524288/1 402.685M items/sec 237.605M items/sec -40.995 {'family_index': 11, 'per_family_instance_index': 3, 'run_name': 'TakeChunkedFlatInt64MonotonicIndices/524288/1', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 537, 'null_percent': 100.0}
TakeChunkedChunkedInt64MonotonicIndices/524288/10 232.518M items/sec 134.620M items/sec -42.103 {'family_index': 2, 'per_family_instance_index': 1, 'run_name': 'TakeChunkedChunkedInt64MonotonicIndices/524288/10', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 312, 'null_percent': 10.0}
TakeChunkedChunkedInt64MonotonicIndices/524288/0 550.430M items/sec 318.038M items/sec -42.220 {'family_index': 2, 'per_family_instance_index': 4, 'run_name': 'TakeChunkedChunkedInt64MonotonicIndices/524288/0', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 733, 'null_percent': 0.0}
TakeChunkedChunkedFSBMonotonicIndices/524288/10/8 233.913M items/sec 134.929M items/sec -42.316 {'family_index': 5, 'per_family_instance_index': 2, 'run_name': 'TakeChunkedChunkedFSBMonotonicIndices/524288/10/8', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 311, 'byte_width': 8.0, 'null_percent': 10.0}
TakeChunkedChunkedFSBMonotonicIndices/524288/0/8 550.806M items/sec 315.331M items/sec -42.751 {'family_index': 5, 'per_family_instance_index': 8, 'run_name': 'TakeChunkedChunkedFSBMonotonicIndices/524288/0/8', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 731, 'byte_width': 8.0, 'null_percent': 0.0}
TakeChunkedChunkedInt64MonotonicIndices/524288/1000 276.014M items/sec 154.981M items/sec -43.850 {'family_index': 2, 'per_family_instance_index': 0, 'run_name': 'TakeChunkedChunkedInt64MonotonicIndices/524288/1000', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 366, 'null_percent': 0.1}
TakeChunkedChunkedFSBMonotonicIndices/524288/1000/8 275.488M items/sec 154.412M items/sec -43.950 {'family_index': 5, 'per_family_instance_index': 0, 'run_name': 'TakeChunkedChunkedFSBMonotonicIndices/524288/1000/8', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 373, 'byte_width': 8.0, 'null_percent': 0.1}
TakeChunkedFlatInt64MonotonicIndices/524288/10 255.324M items/sec 136.158M items/sec -46.672 {'family_index': 11, 'per_family_instance_index': 1, 'run_name': 'TakeChunkedFlatInt64MonotonicIndices/524288/10', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 340, 'null_percent': 10.0}
TakeChunkedFlatInt64MonotonicIndices/524288/1000 309.002M items/sec 163.782M items/sec -46.996 {'family_index': 11, 'per_family_instance_index': 0, 'run_name': 'TakeChunkedFlatInt64MonotonicIndices/524288/1000', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 412, 'null_percent': 0.1}
TakeChunkedFlatInt64MonotonicIndices/524288/0 689.590M items/sec 358.889M items/sec -47.956 {'family_index': 11, 'per_family_instance_index': 4, 'run_name': 'TakeChunkedFlatInt64MonotonicIndices/524288/0', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 918, 'null_percent': 0.0}
TakeChunkedChunkedFSBRandomIndicesWithNulls/524288/2/9 72.765M items/sec 25.177M items/sec -65.400 {'family_index': 4, 'per_family_instance_index': 5, 'run_name': 'TakeChunkedChunkedFSBRandomIndicesWithNulls/524288/2/9', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 95, 'byte_width': 9.0, 'null_percent': 50.0}
TakeChunkedChunkedFSBRandomIndicesWithNulls/524288/2/8 85.926M items/sec 26.774M items/sec -68.841 {'family_index': 4, 'per_family_instance_index': 4, 'run_name': 'TakeChunkedChunkedFSBRandomIndicesWithNulls/524288/2/8', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 115, 'byte_width': 8.0, 'null_percent': 50.0}
TakeChunkedChunkedInt64RandomIndicesWithNulls/524288/2 86.185M items/sec 26.795M items/sec -68.910 {'family_index': 1, 'per_family_instance_index': 2, 'run_name': 'TakeChunkedChunkedInt64RandomIndicesWithNulls/524288/2', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 114, 'null_percent': 50.0}
TakeChunkedFlatInt64RandomIndicesWithNulls/524288/2 89.997M items/sec 26.168M items/sec -70.923 {'family_index': 10, 'per_family_instance_index': 2, 'run_name': 'TakeChunkedFlatInt64RandomIndicesWithNulls/524288/2', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 118, 'null_percent': 50.0}
TakeChunkedChunkedFSBRandomIndicesWithNulls/524288/10/9 99.099M items/sec 27.917M items/sec -71.829 {'family_index': 4, 'per_family_instance_index': 3, 'run_name': 'TakeChunkedChunkedFSBRandomIndicesWithNulls/524288/10/9', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 133, 'byte_width': 9.0, 'null_percent': 10.0}
TakeChunkedChunkedFSBRandomIndicesNoNulls/524288/2/9 100.736M items/sec 26.627M items/sec -73.568 {'family_index': 3, 'per_family_instance_index': 5, 'run_name': 'TakeChunkedChunkedFSBRandomIndicesNoNulls/524288/2/9', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 134, 'byte_width': 9.0, 'null_percent': 50.0}
TakeChunkedChunkedInt64RandomIndicesWithNulls/524288/10 125.265M items/sec 29.373M items/sec -76.551 {'family_index': 1, 'per_family_instance_index': 1, 'run_name': 'TakeChunkedChunkedInt64RandomIndicesWithNulls/524288/10', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 167, 'null_percent': 10.0}
TakeChunkedChunkedFSBRandomIndicesWithNulls/524288/10/8 125.343M items/sec 29.371M items/sec -76.567 {'family_index': 4, 'per_family_instance_index': 2, 'run_name': 'TakeChunkedChunkedFSBRandomIndicesWithNulls/524288/10/8', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 167, 'byte_width': 8.0, 'null_percent': 10.0}
TakeChunkedChunkedInt64RandomIndicesNoNulls/524288/2 130.626M items/sec 28.264M items/sec -78.363 {'family_index': 0, 'per_family_instance_index': 2, 'run_name': 'TakeChunkedChunkedInt64RandomIndicesNoNulls/524288/2', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 175, 'null_percent': 50.0}
TakeChunkedChunkedFSBRandomIndicesNoNulls/524288/10/9 138.137M items/sec 29.807M items/sec -78.422 {'family_index': 3, 'per_family_instance_index': 3, 'run_name': 'TakeChunkedChunkedFSBRandomIndicesNoNulls/524288/10/9', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 184, 'byte_width': 9.0, 'null_percent': 10.0}
TakeChunkedFlatInt64RandomIndicesWithNulls/524288/10 133.355M items/sec 28.684M items/sec -78.491 {'family_index': 10, 'per_family_instance_index': 1, 'run_name': 'TakeChunkedFlatInt64RandomIndicesWithNulls/524288/10', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 178, 'null_percent': 10.0}
TakeChunkedChunkedFSBRandomIndicesNoNulls/524288/2/8 130.778M items/sec 28.093M items/sec -78.519 {'family_index': 3, 'per_family_instance_index': 4, 'run_name': 'TakeChunkedChunkedFSBRandomIndicesNoNulls/524288/2/8', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 174, 'byte_width': 8.0, 'null_percent': 50.0}
TakeChunkedChunkedFSBRandomIndicesWithNulls/524288/1000/9 144.875M items/sec 30.345M items/sec -79.054 {'family_index': 4, 'per_family_instance_index': 1, 'run_name': 'TakeChunkedChunkedFSBRandomIndicesWithNulls/524288/1000/9', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 194, 'byte_width': 9.0, 'null_percent': 0.1}
TakeChunkedChunkedFSBRandomIndicesNoNulls/524288/1000/9 151.949M items/sec 30.617M items/sec -79.851 {'family_index': 3, 'per_family_instance_index': 1, 'run_name': 'TakeChunkedChunkedFSBRandomIndicesNoNulls/524288/1000/9', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 203, 'byte_width': 9.0, 'null_percent': 0.1}
TakeChunkedFlatInt64RandomIndicesNoNulls/524288/2 137.927M items/sec 27.554M items/sec -80.023 {'family_index': 9, 'per_family_instance_index': 2, 'run_name': 'TakeChunkedFlatInt64RandomIndicesNoNulls/524288/2', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 184, 'null_percent': 50.0}
TakeChunkedChunkedFSBRandomIndicesWithNulls/524288/1000/8 174.761M items/sec 31.542M items/sec -81.951 {'family_index': 4, 'per_family_instance_index': 0, 'run_name': 'TakeChunkedChunkedFSBRandomIndicesWithNulls/524288/1000/8', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 235, 'byte_width': 8.0, 'null_percent': 0.1}
TakeChunkedChunkedInt64RandomIndicesWithNulls/524288/1000 175.242M items/sec 31.416M items/sec -82.073 {'family_index': 1, 'per_family_instance_index': 0, 'run_name': 'TakeChunkedChunkedInt64RandomIndicesWithNulls/524288/1000', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 234, 'null_percent': 0.1}
TakeChunkedChunkedInt64RandomIndicesNoNulls/524288/10 177.783M items/sec 31.170M items/sec -82.467 {'family_index': 0, 'per_family_instance_index': 1, 'run_name': 'TakeChunkedChunkedInt64RandomIndicesNoNulls/524288/10', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 238, 'null_percent': 10.0}
TakeChunkedChunkedFSBRandomIndicesNoNulls/524288/10/8 177.439M items/sec 30.984M items/sec -82.538 {'family_index': 3, 'per_family_instance_index': 2, 'run_name': 'TakeChunkedChunkedFSBRandomIndicesNoNulls/524288/10/8', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 236, 'byte_width': 8.0, 'null_percent': 10.0}
TakeChunkedChunkedFSBRandomIndicesNoNulls/524288/1000/8 191.503M items/sec 31.727M items/sec -83.433 {'family_index': 3, 'per_family_instance_index': 0, 'run_name': 'TakeChunkedChunkedFSBRandomIndicesNoNulls/524288/1000/8', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 255, 'byte_width': 8.0, 'null_percent': 0.1}
TakeChunkedChunkedInt64RandomIndicesNoNulls/524288/1000 193.078M items/sec 31.946M items/sec -83.454 {'family_index': 0, 'per_family_instance_index': 0, 'run_name': 'TakeChunkedChunkedInt64RandomIndicesNoNulls/524288/1000', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 256, 'null_percent': 0.1}
TakeChunkedFlatInt64RandomIndicesNoNulls/524288/10 189.467M items/sec 30.142M items/sec -84.091 {'family_index': 9, 'per_family_instance_index': 1, 'run_name': 'TakeChunkedFlatInt64RandomIndicesNoNulls/524288/10', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 252, 'null_percent': 10.0}
TakeChunkedFlatInt64RandomIndicesWithNulls/524288/1000 196.665M items/sec 30.772M items/sec -84.353 {'family_index': 10, 'per_family_instance_index': 0, 'run_name': 'TakeChunkedFlatInt64RandomIndicesWithNulls/524288/1000', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 263, 'null_percent': 0.1}
TakeChunkedChunkedFSBRandomIndicesNoNulls/524288/1/9 222.674M items/sec 34.244M items/sec -84.622 {'family_index': 3, 'per_family_instance_index': 7, 'run_name': 'TakeChunkedChunkedFSBRandomIndicesNoNulls/524288/1/9', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 298, 'byte_width': 9.0, 'null_percent': 100.0}
TakeChunkedFlatInt64RandomIndicesNoNulls/524288/1000 206.747M items/sec 30.798M items/sec -85.104 {'family_index': 9, 'per_family_instance_index': 0, 'run_name': 'TakeChunkedFlatInt64RandomIndicesNoNulls/524288/1000', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 275, 'null_percent': 0.1}
TakeChunkedChunkedFSBRandomIndicesWithNulls/524288/0/9 246.430M items/sec 35.724M items/sec -85.504 {'family_index': 4, 'per_family_instance_index': 9, 'run_name': 'TakeChunkedChunkedFSBRandomIndicesWithNulls/524288/0/9', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 327, 'byte_width': 9.0, 'null_percent': 0.0}
TakeChunkedChunkedFSBRandomIndicesNoNulls/524288/0/9 246.763M items/sec 35.540M items/sec -85.598 {'family_index': 3, 'per_family_instance_index': 9, 'run_name': 'TakeChunkedChunkedFSBRandomIndicesNoNulls/524288/0/9', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 330, 'byte_width': 9.0, 'null_percent': 0.0}
TakeChunkedChunkedInt64RandomIndicesNoNulls/524288/1 346.024M items/sec 35.537M items/sec -89.730 {'family_index': 0, 'per_family_instance_index': 3, 'run_name': 'TakeChunkedChunkedInt64RandomIndicesNoNulls/524288/1', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 460, 'null_percent': 100.0}
TakeChunkedChunkedFSBRandomIndicesNoNulls/524288/1/8 346.674M items/sec 35.559M items/sec -89.743 {'family_index': 3, 'per_family_instance_index': 6, 'run_name': 'TakeChunkedChunkedFSBRandomIndicesNoNulls/524288/1/8', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 459, 'byte_width': 8.0, 'null_percent': 100.0}
TakeChunkedFlatInt64RandomIndicesNoNulls/524288/1 394.791M items/sec 34.899M items/sec -91.160 {'family_index': 9, 'per_family_instance_index': 3, 'run_name': 'TakeChunkedFlatInt64RandomIndicesNoNulls/524288/1', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 520, 'null_percent': 100.0}
TakeChunkedChunkedInt64RandomIndicesWithNulls/524288/0 465.112M items/sec 38.014M items/sec -91.827 {'family_index': 1, 'per_family_instance_index': 4, 'run_name': 'TakeChunkedChunkedInt64RandomIndicesWithNulls/524288/0', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 624, 'null_percent': 0.0}
TakeChunkedChunkedInt64RandomIndicesNoNulls/524288/0 468.038M items/sec 38.050M items/sec -91.870 {'family_index': 0, 'per_family_instance_index': 4, 'run_name': 'TakeChunkedChunkedInt64RandomIndicesNoNulls/524288/0', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 627, 'null_percent': 0.0}
TakeChunkedChunkedFSBRandomIndicesWithNulls/524288/0/8 468.096M items/sec 38.036M items/sec -91.874 {'family_index': 4, 'per_family_instance_index': 8, 'run_name': 'TakeChunkedChunkedFSBRandomIndicesWithNulls/524288/0/8', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 622, 'byte_width': 8.0, 'null_percent': 0.0}
TakeChunkedChunkedFSBRandomIndicesNoNulls/524288/0/8 468.613M items/sec 37.968M items/sec -91.898 {'family_index': 3, 'per_family_instance_index': 8, 'run_name': 'TakeChunkedChunkedFSBRandomIndicesNoNulls/524288/0/8', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 625, 'byte_width': 8.0, 'null_percent': 0.0}
TakeChunkedFlatInt64RandomIndicesWithNulls/524288/0 537.788M items/sec 36.774M items/sec -93.162 {'family_index': 10, 'per_family_instance_index': 4, 'run_name': 'TakeChunkedFlatInt64RandomIndicesWithNulls/524288/0', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 717, 'null_percent': 0.0}
TakeChunkedFlatInt64RandomIndicesNoNulls/524288/0 537.631M items/sec 36.663M items/sec -93.181 {'family_index': 9, 'per_family_instance_index': 4, 'run_name': 'TakeChunkedFlatInt64RandomIndicesNoNulls/524288/0', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 711, 'null_percent': 0.0}
TakeChunkedChunkedFSBRandomIndicesWithNulls/524288/1/9 894.444M items/sec 40.420M items/sec -95.481 {'family_index': 4, 'per_family_instance_index': 7, 'run_name': 'TakeChunkedChunkedFSBRandomIndicesWithNulls/524288/1/9', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 1165, 'byte_width': 9.0, 'null_percent': 100.0}
TakeChunkedChunkedFSBRandomIndicesWithNulls/524288/1/8 1.087G items/sec 40.402M items/sec -96.283 {'family_index': 4, 'per_family_instance_index': 6, 'run_name': 'TakeChunkedChunkedFSBRandomIndicesWithNulls/524288/1/8', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 1456, 'byte_width': 8.0, 'null_percent': 100.0}
TakeChunkedChunkedInt64RandomIndicesWithNulls/524288/1 1.100G items/sec 40.322M items/sec -96.333 {'family_index': 1, 'per_family_instance_index': 3, 'run_name': 'TakeChunkedChunkedInt64RandomIndicesWithNulls/524288/1', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 1473, 'null_percent': 100.0}
TakeChunkedFlatInt64RandomIndicesWithNulls/524288/1 1.840G items/sec 39.773M items/sec -97.839 {'family_index': 10, 'per_family_instance_index': 3, 'run_name': 'TakeChunkedFlatInt64RandomIndicesWithNulls/524288/1', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 2450, 'null_percent': 100.0}
Note that there's something inherently suboptimal in this PR: we're trading the concatenation of the chunked values (essentially allocating a new values array) against the resolution of many chunked indices (essentially allocating two new indices arrays). This is only beneficial if the value width is quite large (say a 256-byte FSB) or the number of indices is much smaller than the number of values.
(the fact that our take benchmarks currently use the same number of indices as values doesn't help :-))
In the end, perhaps we want to guard this with a heuristic based on total byte size of values and length of indices.
Note that there's something inherently suboptimal in this PR: we're trading the concatenation of the chunked values (essentially allocating a new values array) against the resolution of many chunked indices (essentially allocating two new indices arrays). This is only beneficial if the value width is quite large (say a 256-byte FSB) or the number of indices is much smaller than the number of values.
I don't remember getting bad results like this on an ARM Macbook. I will try benchmarking on x86 as well.
(the fact that our take benchmarks currently use the same number of indices as values doesn't help :-))
Indeed. My assumption when going for the allocation of indices was that indices are cheaper to allocate than the concatenation of values.
In the end, perhaps we want to guard this with a heuristic based on total byte size of values and length of indices.
Yeah, but this kind of branching is really bad for the confidence in unit tests. I will see what can be done.
Yeah, but this kind of branching is really bad for the confidence in unit tests. I will see what can be done.
We can do that in a later PR and issue as well.
Revision: c0de2252c2ec3bd208e50bc88393dfb8163d8c5b
Submitted crossbow builds: ursacomputing/crossbow @ actions-ea181364bd
Something you could do in this PR, though, is to resolve the indices in chunks of say 1024. This would at least ease the allocation cost and memory pressure.
I submitted PR #43772 to add benchmarks with a smaller selection factor (the current take benchmarks would have a selection factor of 1.0). We could merge that PR before yours, to check what the slowdown is in that case too.
I ran the new benchmarks (those with a small selection factor) and the results are more varied, see below:
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Non-regressions: (13)
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
benchmark baseline contender change % counters
TakeChunkedFlatInt64FewMonotonicIndices/524288/0 3.487G items/sec 6.131G items/sec 75.818 {'family_index': 5, 'per_family_instance_index': 4, 'run_name': 'TakeChunkedFlatInt64FewMonotonicIndices/524288/0', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 4676, 'null_percent': 0.0, 'selection_factor': 0.05}
TakeChunkedChunkedInt64FewMonotonicIndices/524288/0 3.397G items/sec 5.278G items/sec 55.353 {'family_index': 1, 'per_family_instance_index': 4, 'run_name': 'TakeChunkedChunkedInt64FewMonotonicIndices/524288/0', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 4603, 'null_percent': 0.0, 'selection_factor': 0.05}
TakeChunkedFlatInt64FewMonotonicIndices/524288/1 3.019G items/sec 4.508G items/sec 49.291 {'family_index': 5, 'per_family_instance_index': 3, 'run_name': 'TakeChunkedFlatInt64FewMonotonicIndices/524288/1', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 4000, 'null_percent': 100.0, 'selection_factor': 0.05}
TakeChunkedChunkedInt64FewMonotonicIndices/524288/1 2.802G items/sec 4.135G items/sec 47.589 {'family_index': 1, 'per_family_instance_index': 3, 'run_name': 'TakeChunkedChunkedInt64FewMonotonicIndices/524288/1', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 3740, 'null_percent': 100.0, 'selection_factor': 0.05}
TakeChunkedFlatInt64FewMonotonicIndices/524288/1000 2.316G items/sec 2.910G items/sec 25.628 {'family_index': 5, 'per_family_instance_index': 0, 'run_name': 'TakeChunkedFlatInt64FewMonotonicIndices/524288/1000', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 3104, 'null_percent': 0.1, 'selection_factor': 0.05}
TakeChunkedChunkedInt64FewMonotonicIndices/524288/1000 2.254G items/sec 2.684G items/sec 19.075 {'family_index': 1, 'per_family_instance_index': 0, 'run_name': 'TakeChunkedChunkedInt64FewMonotonicIndices/524288/1000', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 3082, 'null_percent': 0.1, 'selection_factor': 0.05}
TakeChunkedFlatInt64FewMonotonicIndices/524288/10 2.257G items/sec 2.635G items/sec 16.733 {'family_index': 5, 'per_family_instance_index': 1, 'run_name': 'TakeChunkedFlatInt64FewMonotonicIndices/524288/10', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 3052, 'null_percent': 10.0, 'selection_factor': 0.05}
TakeChunkedChunkedInt64FewMonotonicIndices/524288/10 2.179G items/sec 2.466G items/sec 13.175 {'family_index': 1, 'per_family_instance_index': 1, 'run_name': 'TakeChunkedChunkedInt64FewMonotonicIndices/524288/10', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 2878, 'null_percent': 10.0, 'selection_factor': 0.05}
TakeChunkedChunkedStringFewRandomIndicesWithNulls/524288/1 5.541G items/sec 5.568G items/sec 0.490 {'family_index': 2, 'per_family_instance_index': 3, 'run_name': 'TakeChunkedChunkedStringFewRandomIndicesWithNulls/524288/1', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 7344, 'null_percent': 100.0, 'selection_factor': 0.05}
TakeChunkedChunkedStringFewMonotonicIndices/524288/1 2.753G items/sec 2.736G items/sec -0.625 {'family_index': 3, 'per_family_instance_index': 3, 'run_name': 'TakeChunkedChunkedStringFewMonotonicIndices/524288/1', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 3674, 'null_percent': 100.0, 'selection_factor': 0.05}
TakeChunkedFlatInt64FewMonotonicIndices/524288/2 1.880G items/sec 1.852G items/sec -1.442 {'family_index': 5, 'per_family_instance_index': 2, 'run_name': 'TakeChunkedFlatInt64FewMonotonicIndices/524288/2', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 2520, 'null_percent': 50.0, 'selection_factor': 0.05}
TakeChunkedChunkedInt64FewMonotonicIndices/524288/2 1.818G items/sec 1.766G items/sec -2.872 {'family_index': 1, 'per_family_instance_index': 2, 'run_name': 'TakeChunkedChunkedInt64FewMonotonicIndices/524288/2', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 2403, 'null_percent': 50.0, 'selection_factor': 0.05}
TakeChunkedChunkedStringFewRandomIndicesWithNulls/524288/1000 269.164M items/sec 256.848M items/sec -4.576 {'family_index': 2, 'per_family_instance_index': 0, 'run_name': 'TakeChunkedChunkedStringFewRandomIndicesWithNulls/524288/1000', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 358, 'null_percent': 0.1, 'selection_factor': 0.05}
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Regressions: (17)
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
benchmark baseline contender change % counters
TakeChunkedChunkedStringFewMonotonicIndices/524288/2 854.187M items/sec 810.956M items/sec -5.061 {'family_index': 3, 'per_family_instance_index': 2, 'run_name': 'TakeChunkedChunkedStringFewMonotonicIndices/524288/2', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 1124, 'null_percent': 50.0, 'selection_factor': 0.05}
TakeChunkedChunkedStringFewMonotonicIndices/524288/0 420.243M items/sec 393.347M items/sec -6.400 {'family_index': 3, 'per_family_instance_index': 4, 'run_name': 'TakeChunkedChunkedStringFewMonotonicIndices/524288/0', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 561, 'null_percent': 0.0, 'selection_factor': 0.05}
TakeChunkedChunkedStringFewRandomIndicesWithNulls/524288/2 751.346M items/sec 700.885M items/sec -6.716 {'family_index': 2, 'per_family_instance_index': 2, 'run_name': 'TakeChunkedChunkedStringFewRandomIndicesWithNulls/524288/2', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 982, 'null_percent': 50.0, 'selection_factor': 0.05}
TakeChunkedChunkedStringFewMonotonicIndices/524288/1000 383.643M items/sec 353.206M items/sec -7.934 {'family_index': 3, 'per_family_instance_index': 0, 'run_name': 'TakeChunkedChunkedStringFewMonotonicIndices/524288/1000', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 511, 'null_percent': 0.1, 'selection_factor': 0.05}
TakeChunkedChunkedStringFewRandomIndicesWithNulls/524288/0 308.601M items/sec 283.361M items/sec -8.179 {'family_index': 2, 'per_family_instance_index': 4, 'run_name': 'TakeChunkedChunkedStringFewRandomIndicesWithNulls/524288/0', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 411, 'null_percent': 0.0, 'selection_factor': 0.05}
TakeChunkedChunkedStringFewRandomIndicesWithNulls/524288/10 327.708M items/sec 300.772M items/sec -8.220 {'family_index': 2, 'per_family_instance_index': 1, 'run_name': 'TakeChunkedChunkedStringFewRandomIndicesWithNulls/524288/10', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 440, 'null_percent': 10.0, 'selection_factor': 0.05}
TakeChunkedChunkedStringFewMonotonicIndices/524288/10 452.054M items/sec 393.971M items/sec -12.849 {'family_index': 3, 'per_family_instance_index': 1, 'run_name': 'TakeChunkedChunkedStringFewMonotonicIndices/524288/10', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 602, 'null_percent': 10.0, 'selection_factor': 0.05}
TakeChunkedChunkedInt64FewRandomIndicesWithNulls/524288/2 1.276G items/sec 509.634M items/sec -60.060 {'family_index': 0, 'per_family_instance_index': 2, 'run_name': 'TakeChunkedChunkedInt64FewRandomIndicesWithNulls/524288/2', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 1691, 'null_percent': 50.0, 'selection_factor': 0.05}
TakeChunkedFlatInt64FewRandomIndicesWithNulls/524288/2 1.323G items/sec 517.354M items/sec -60.894 {'family_index': 4, 'per_family_instance_index': 2, 'run_name': 'TakeChunkedFlatInt64FewRandomIndicesWithNulls/524288/2', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 1750, 'null_percent': 50.0, 'selection_factor': 0.05}
TakeChunkedChunkedInt64FewRandomIndicesWithNulls/524288/10 1.615G items/sec 549.705M items/sec -65.960 {'family_index': 0, 'per_family_instance_index': 1, 'run_name': 'TakeChunkedChunkedInt64FewRandomIndicesWithNulls/524288/10', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 2171, 'null_percent': 10.0, 'selection_factor': 0.05}
TakeChunkedFlatInt64FewRandomIndicesWithNulls/524288/10 1.709G items/sec 553.717M items/sec -67.595 {'family_index': 4, 'per_family_instance_index': 1, 'run_name': 'TakeChunkedFlatInt64FewRandomIndicesWithNulls/524288/10', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 2274, 'null_percent': 10.0, 'selection_factor': 0.05}
TakeChunkedChunkedInt64FewRandomIndicesWithNulls/524288/1000 2.026G items/sec 587.265M items/sec -71.021 {'family_index': 0, 'per_family_instance_index': 0, 'run_name': 'TakeChunkedChunkedInt64FewRandomIndicesWithNulls/524288/1000', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 2670, 'null_percent': 0.1, 'selection_factor': 0.05}
TakeChunkedFlatInt64FewRandomIndicesWithNulls/524288/1000 2.078G items/sec 588.468M items/sec -71.680 {'family_index': 4, 'per_family_instance_index': 0, 'run_name': 'TakeChunkedFlatInt64FewRandomIndicesWithNulls/524288/1000', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 2771, 'null_percent': 0.1, 'selection_factor': 0.05}
TakeChunkedChunkedInt64FewRandomIndicesWithNulls/524288/0 3.429G items/sec 717.438M items/sec -79.076 {'family_index': 0, 'per_family_instance_index': 4, 'run_name': 'TakeChunkedChunkedInt64FewRandomIndicesWithNulls/524288/0', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 4639, 'null_percent': 0.0, 'selection_factor': 0.05}
TakeChunkedFlatInt64FewRandomIndicesWithNulls/524288/0 3.582G items/sec 721.296M items/sec -79.861 {'family_index': 4, 'per_family_instance_index': 4, 'run_name': 'TakeChunkedFlatInt64FewRandomIndicesWithNulls/524288/0', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 4799, 'null_percent': 0.0, 'selection_factor': 0.05}
TakeChunkedChunkedInt64FewRandomIndicesWithNulls/524288/1 3.873G items/sec 759.915M items/sec -80.381 {'family_index': 0, 'per_family_instance_index': 3, 'run_name': 'TakeChunkedChunkedInt64FewRandomIndicesWithNulls/524288/1', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 5316, 'null_percent': 100.0, 'selection_factor': 0.05}
TakeChunkedFlatInt64FewRandomIndicesWithNulls/524288/1 4.173G items/sec 761.551M items/sec -81.751 {'family_index': 4, 'per_family_instance_index': 3, 'run_name': 'TakeChunkedFlatInt64FewRandomIndicesWithNulls/524288/1', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 5385, 'null_percent': 100.0, 'selection_factor': 0.05}
Putting aside the String perf changes which are minor and probably irrelevant, we can see that on monotonic indices, this PR actually increases performance, while still hurting it on random indices.
...while still hurting it on random indices.
The binary searches for chunk index are a costly part of the process that is avoided when value chunks are concatenated.
I have ideas to make the search interleaved so that multiple binary searches can happen in parallel (i.e. likely sharing the same cache line accesses to make progress).
@pitrou wouldn't it make sense to keep the responsibility for concatenation to a layer above the kernels? Like a query optimizer? They are in a better position to make memory/time trade-offs than the context-less kernel.
The worst regression (-81%) has the kernel still at 4G items/sec.
TakeChunkedFlatInt64FewRandomIndicesWithNulls
4.173G items/sec 761.551M items/sec -81.751
I find it very inelegant to put these heuristics at the compute kernel level.
Imagine a pipeline trying to save on memory allocations by keeping the array chunked as much as possible and then a simple filter operation requires allocating enough memory to keep it all in memory.
Another case would be a pipeline where the caller is consolidating a big contiguous array for more operations than just array_take. They should be the ones concatenating.
@pitrou wouldn't it make sense to keep the responsibility for concatenation to a layer above the kernels? Like a query optimizer? They are in a better position to make memory/time trade-offs than the context-less kernel.
Ideally perhaps. In practice this assumes that 1) there is a query optimizer 2) it has enough information about implementation details to make an informed decision.
In practice, take is probably often called directly in the context of PyArrow's sort methods. So this
https://github.com/apache/arrow/blob/e62fbaafd129931b1c217fcaa1b4c254087ab289/python/pyarrow/table.pxi#L1139-L1143
The worst regression (-81%) has the kernel still at 4G items/sec.
I might be misreading, but this is the worst regression on the new benchmarks, right (those with a small selection factor)? On the benchmarks with a 1.0 selection factor (such as when sorting), the worst absolute results are around 25 Mitems/sec AFAICT. Or are those results obsolete?
Imagine a pipeline trying to save on memory allocations by keeping the array chunked as much as possible and then a simple filter operation requires allocating enough memory to keep it all in memory.
Well, currently "take" would always concatenate array chunks, so at least there is no regression in that regard.
Still, I understand the concern. We might want to expose an additional option to entirely disable concatenation when possible. But that might be overkill as well.
@pitrou what conditional checks should I add here to avoid regressions? I'm giving up on making the non-concatenation versions work well for integer arrays and want to merge this PR sooner rather than later and then start working on the string array implementation which is what will unlock most user value in the first place.
By building on this (arguably simplified) analysis:
we're trading the concatenation of the chunked values (essentially allocating a new values array) against the resolution of many chunked indices (essentially allocating two new indices arrays). This is only beneficial if the value width is quite large (say a 256-byte FSB) or the number of indices is much smaller than the number of values.
and assuming the following known values:
n_values: length of the values inputn_indices: length of the indices input (governing the output length)value_width: byte width of the individual values
Then a simple heuristic could be to concatenate iff n_indices * 16 > n_values * value_width. This wouldn't take into account the larger computational cost associated with chunked indexing, but at least it would disable the chunked resolution approach when it doesn't make sense at all.
(btw, a moderate improvement could probably be achieved by using CompressedChunkLocation)