arrow ARROW-9773: [C++] Implement Take kernel for ChunkedArray

Aug 11 '22 17:08 wjones127

https://issues.apache.org/jira/browse/ARROW-9773

Aug 11 '22 17:08 github-actions[bot]

@ursabot please benchmark lang=C++

Sep 16 '22 22:09 wjones127

Benchmark runs are scheduled for baseline = e63a13aacbf67897202c8a56fccb3a86f624a96e and contender = a43fa07eda14d3f36a9dda593af03d5f1a26325d. Results will be available as each benchmark for each run completes. Conbench compare runs links: [Skipped :warning: Only ['Python'] langs are supported on ec2-t3-xlarge-us-east-2] ec2-t3-xlarge-us-east-2 [Failed :arrow_down:1.19% :arrow_up:1.02%] test-mac-arm [Skipped :warning: Only ['JavaScript', 'Python', 'R'] langs are supported on ursa-i9-9960x] ursa-i9-9960x [Finished :arrow_down:0.53% :arrow_up:1.49%] ursa-thinkcentre-m75q Buildkite builds: [Finished] a43fa07e test-mac-arm [Finished] a43fa07e ursa-thinkcentre-m75q [Finished] e63a13aa ec2-t3-xlarge-us-east-2 [Failed] e63a13aa test-mac-arm [Failed] e63a13aa ursa-i9-9960x [Finished] e63a13aa ursa-thinkcentre-m75q Supported benchmarks: ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True test-mac-arm: Supported benchmark langs: C++, Python, R ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

Sep 16 '22 22:09 ursabot

I started by working on the Take implementation for primitive values, so I could familiarize myself with how the Take kernels work. But based on the benchmark I added, it seems like I actually made performance much worse (except in the monotonic case)! I suspect this is because having to use ChunkResolver for every index is more expensive than just copying the values into a contiguous array. Does that sounds reasonable? Or is there something obviously wrong?

Benchmark results

Baseline:

--------------------------------------------------------------------------------------------------------------
Benchmark                                                    Time             CPU   Iterations UserCounters...
--------------------------------------------------------------------------------------------------------------
TakeChunkedInt64RandomIndicesNoNulls/4194304/1000     19362199 ns     19358250 ns           36 items_per_second=216.668M/s null_percent=0.1 size=4.1943M
TakeChunkedInt64RandomIndicesNoNulls/4194304/10       19504339 ns     19494278 ns           36 items_per_second=215.156M/s null_percent=10 size=4.1943M
TakeChunkedInt64RandomIndicesNoNulls/4194304/2        34162071 ns     34146150 ns           20 items_per_second=122.834M/s null_percent=50 size=4.1943M
TakeChunkedInt64RandomIndicesNoNulls/4194304/1        10458465 ns     10455803 ns           66 items_per_second=401.146M/s null_percent=100 size=4.1943M
TakeChunkedInt64RandomIndicesNoNulls/4194304/0        12260952 ns     12258093 ns           54 items_per_second=342.166M/s null_percent=0 size=4.1943M
TakeChunkedInt64RandomIndicesWithNulls/4194304/1000   19419778 ns     19412389 ns           36 items_per_second=216.063M/s null_percent=0.1 size=4.1943M
TakeChunkedInt64RandomIndicesWithNulls/4194304/10     29953237 ns     29944261 ns           23 items_per_second=140.07M/s null_percent=10 size=4.1943M
TakeChunkedInt64RandomIndicesWithNulls/4194304/2      51350571 ns     51330500 ns           14 items_per_second=81.7117M/s null_percent=50 size=4.1943M
TakeChunkedInt64RandomIndicesWithNulls/4194304/1       3319791 ns      3318972 ns          214 items_per_second=1.26374G/s null_percent=100 size=4.1943M
TakeChunkedInt64RandomIndicesWithNulls/4194304/0      12277404 ns     12275145 ns           55 items_per_second=341.691M/s null_percent=0 size=4.1943M
TakeChunkedInt64MonotonicIndices/4194304/1000         24581060 ns     24574690 ns           29 items_per_second=170.676M/s null_percent=0.1 size=4.1943M
TakeChunkedInt64MonotonicIndices/4194304/10           22506711 ns     22501129 ns           31 items_per_second=186.404M/s null_percent=10 size=4.1943M
TakeChunkedInt64MonotonicIndices/4194304/2            20736080 ns     20730853 ns           34 items_per_second=202.322M/s null_percent=50 size=4.1943M
TakeChunkedInt64MonotonicIndices/4194304/1            16202271 ns     16196349 ns           43 items_per_second=258.966M/s null_percent=100 size=4.1943M
TakeChunkedInt64MonotonicIndices/4194304/0            15727504 ns     15721614 ns           44 items_per_second=266.786M/s null_percent=0 size=4.1943M

Proposed:

--------------------------------------------------------------------------------------------------------------
Benchmark                                                    Time             CPU   Iterations UserCounters...
--------------------------------------------------------------------------------------------------------------
TakeChunkedInt64RandomIndicesNoNulls/4194304/1000    142831500 ns    142791200 ns            5 items_per_second=29.3737M/s null_percent=0.1 size=4.1943M
TakeChunkedInt64RandomIndicesNoNulls/4194304/10      144134633 ns    144110400 ns            5 items_per_second=29.1048M/s null_percent=10 size=4.1943M
TakeChunkedInt64RandomIndicesNoNulls/4194304/2       125704833 ns    125667167 ns            6 items_per_second=33.3763M/s null_percent=50 size=4.1943M
TakeChunkedInt64RandomIndicesNoNulls/4194304/1        84408114 ns     84386875 ns            8 items_per_second=49.7033M/s null_percent=100 size=4.1943M
TakeChunkedInt64RandomIndicesNoNulls/4194304/0        88094063 ns     88072375 ns            8 items_per_second=47.6234M/s null_percent=0 size=4.1943M
TakeChunkedInt64RandomIndicesWithNulls/4194304/1000  111903111 ns    111859500 ns            6 items_per_second=37.4962M/s null_percent=0.1 size=4.1943M
TakeChunkedInt64RandomIndicesWithNulls/4194304/10    113359923 ns    113286667 ns            6 items_per_second=37.0238M/s null_percent=10 size=4.1943M
TakeChunkedInt64RandomIndicesWithNulls/4194304/2      95110995 ns     95098625 ns            8 items_per_second=44.1048M/s null_percent=50 size=4.1943M
TakeChunkedInt64RandomIndicesWithNulls/4194304/1       1613900 ns      1613515 ns          437 items_per_second=2.59948G/s null_percent=100 size=4.1943M
TakeChunkedInt64RandomIndicesWithNulls/4194304/0      88383021 ns     88365750 ns            8 items_per_second=47.4653M/s null_percent=0 size=4.1943M
TakeChunkedInt64MonotonicIndices/4194304/1000         23783853 ns     23776276 ns           29 items_per_second=176.407M/s null_percent=0.1 size=4.1943M
TakeChunkedInt64MonotonicIndices/4194304/10           24145126 ns     24140310 ns           29 items_per_second=173.747M/s null_percent=10 size=4.1943M
TakeChunkedInt64MonotonicIndices/4194304/2            23058231 ns     23046233 ns           30 items_per_second=181.995M/s null_percent=50 size=4.1943M
TakeChunkedInt64MonotonicIndices/4194304/1            16306472 ns     16301465 ns           43 items_per_second=257.296M/s null_percent=100 size=4.1943M
TakeChunkedInt64MonotonicIndices/4194304/0            15400245 ns     15398652 ns           46 items_per_second=272.381M/s null_percent=0 size=4.1943M

Sep 19 '22 19:09 wjones127

@wjones127 The numbers seem a bit low to be honest, but perhaps that's just me. I haven't looked at the implementation. @lidavidm What do you think?

Sep 19 '22 19:09 pitrou

The numbers seem a bit low to be honest

You are correct on that. Both too low in test and baseline, by about the same factor. I was creating too large of a chunked array for the indices.

Benchmark results

Baseline:

--------------------------------------------------------------------------------------------------------------
Benchmark                                                    Time             CPU   Iterations UserCounters...
--------------------------------------------------------------------------------------------------------------
TakeChunkedInt64RandomIndicesNoNulls/4194304/1000      2770528 ns      2769802 ns          232 items_per_second=1.5143G/s null_percent=0.1 size=4.1943M
TakeChunkedInt64RandomIndicesNoNulls/4194304/10        2802730 ns      2802061 ns          246 items_per_second=1.49686G/s null_percent=10 size=4.1943M
TakeChunkedInt64RandomIndicesNoNulls/4194304/2         4277424 ns      4276390 ns          164 items_per_second=980.805M/s null_percent=50 size=4.1943M
TakeChunkedInt64RandomIndicesNoNulls/4194304/1         2143412 ns      2142790 ns          305 items_per_second=1.9574G/s null_percent=100 size=4.1943M
TakeChunkedInt64RandomIndicesNoNulls/4194304/0         1886756 ns      1886230 ns          374 items_per_second=2.22364G/s null_percent=0 size=4.1943M
TakeChunkedInt64RandomIndicesWithNulls/4194304/1000    2799584 ns      2799301 ns          249 items_per_second=1.49834G/s null_percent=0.1 size=4.1943M
TakeChunkedInt64RandomIndicesWithNulls/4194304/10      3865247 ns      3864123 ns          179 items_per_second=1085.45M/s null_percent=10 size=4.1943M
TakeChunkedInt64RandomIndicesWithNulls/4194304/2       6030057 ns      6028330 ns          103 items_per_second=695.765M/s null_percent=50 size=4.1943M
TakeChunkedInt64RandomIndicesWithNulls/4194304/1       1441460 ns      1440831 ns          498 items_per_second=2.91103G/s null_percent=100 size=4.1943M
TakeChunkedInt64RandomIndicesWithNulls/4194304/0       1923944 ns      1923507 ns          371 items_per_second=2.18055G/s null_percent=0 size=4.1943M
TakeChunkedInt64MonotonicIndices/4194304/1000          1827538 ns      1827005 ns          383 items_per_second=2.29573G/s null_percent=0.1 size=4.1943M
TakeChunkedInt64MonotonicIndices/4194304/10            1806226 ns      1805824 ns          387 items_per_second=2.32265G/s null_percent=10 size=4.1943M
TakeChunkedInt64MonotonicIndices/4194304/2             2600097 ns      2599818 ns          269 items_per_second=1.61331G/s null_percent=50 size=4.1943M
TakeChunkedInt64MonotonicIndices/4194304/1             1069649 ns      1069057 ns          667 items_per_second=3.92337G/s null_percent=100 size=4.1943M
TakeChunkedInt64MonotonicIndices/4194304/0              949020 ns       948810 ns          738 items_per_second=4.42059G/s null_percent=0 size=4.1943M

Proposed:

--------------------------------------------------------------------------------------------------------------
Benchmark                                                    Time             CPU   Iterations UserCounters...
--------------------------------------------------------------------------------------------------------------
TakeChunkedInt64RandomIndicesNoNulls/4194304/1000     12734298 ns     12731491 ns           53 items_per_second=329.443M/s null_percent=0.1 size=4.1943M
TakeChunkedInt64RandomIndicesNoNulls/4194304/10       13030870 ns     13027741 ns           54 items_per_second=321.952M/s null_percent=10 size=4.1943M
TakeChunkedInt64RandomIndicesNoNulls/4194304/2        11699983 ns     11697067 ns           60 items_per_second=358.577M/s null_percent=50 size=4.1943M
TakeChunkedInt64RandomIndicesNoNulls/4194304/1         8201179 ns      8200176 ns           85 items_per_second=511.489M/s null_percent=100 size=4.1943M
TakeChunkedInt64RandomIndicesNoNulls/4194304/0         8281752 ns      8280094 ns           85 items_per_second=506.553M/s null_percent=0 size=4.1943M
TakeChunkedInt64RandomIndicesWithNulls/4194304/1000   10054480 ns     10052444 ns           54 items_per_second=417.242M/s null_percent=0.1 size=4.1943M
TakeChunkedInt64RandomIndicesWithNulls/4194304/10     10256613 ns     10254956 ns           68 items_per_second=409.003M/s null_percent=10 size=4.1943M
TakeChunkedInt64RandomIndicesWithNulls/4194304/2       8735435 ns      8734213 ns           80 items_per_second=480.215M/s null_percent=50 size=4.1943M
TakeChunkedInt64RandomIndicesWithNulls/4194304/1         98487 ns        98479 ns         7191 items_per_second=42.5911G/s null_percent=100 size=4.1943M
TakeChunkedInt64RandomIndicesWithNulls/4194304/0       8187746 ns      8186131 ns           84 items_per_second=512.367M/s null_percent=0 size=4.1943M
TakeChunkedInt64MonotonicIndices/4194304/1000          2008840 ns      2007886 ns          352 items_per_second=2.08892G/s null_percent=0.1 size=4.1943M
TakeChunkedInt64MonotonicIndices/4194304/10            2467236 ns      2466807 ns          285 items_per_second=1.7003G/s null_percent=10 size=4.1943M
TakeChunkedInt64MonotonicIndices/4194304/2             3366623 ns      3365923 ns          208 items_per_second=1.24611G/s null_percent=50 size=4.1943M
TakeChunkedInt64MonotonicIndices/4194304/1             1295032 ns      1294838 ns          538 items_per_second=3.23925G/s null_percent=100 size=4.1943M
TakeChunkedInt64MonotonicIndices/4194304/0             1240969 ns      1240848 ns          566 items_per_second=3.38019G/s null_percent=0 size=4.1943M

Sep 19 '22 21:09 wjones127

After some more testing it seems like concatenating buffers and indexing into that always wins over using ChunkResolver, both in primitive and string case.

From my quick test of Take on chunked string arrays:

Concatenating: items_per_second=50.3014M/s
ChunkResolver: items_per_second=6.9115M/s

Unless there is a more performant way, we might just only have Take kernels specialized for ChunkedArrays for String / Binary / List (and also Struct since it will then need to handle rechunking of child arrays.)

@edponce I know you looked at ChunkResolver recently in #12055. Do these results seem reasonable to you?

String ChunkedArray Take Benchmark Code

void BenchStringTest() {
    // Create chunked string array
    int32_t string_min_length = 0, string_max_length = 32;
    const int64_t n_chunks = 10;
    const int64_t array_size = args.size / n_chunks;
    ArrayVector chunks;
    for (int64_t i = 0; i < n_chunks; ++i) {
      auto chunk = std::static_pointer_cast<StringArray>(
          rand.String(args.size, string_min_length, string_max_length, 0));
      chunks.push_back(chunk);
    }
    auto values = ChunkedArray(chunks);

    // Create indices
    auto indices =
        rand.Int32(values.length(), 0, static_cast<int32_t>(values.length() - 1), 0);

    for (auto _ : state) {
      TypedBufferBuilder<int32_t> offset_builder;
      TypedBufferBuilder<uint8_t> data_builder;

      const int32_t* indices_values = indices->data()->GetValues<int32_t>(1);

      if (concat_chunks) {
        // Concat the chunks
        ASSIGN_OR_ABORT(std::shared_ptr<Array> values_combined,
                        Concatenate(values.chunks()));

        const uint8_t* values_data = values_combined->data()->GetValues<uint8_t>(1);
        const int32_t* values_offsets = values_combined->data()->GetValues<int32_t>(2);

        // for each value
        for (int i = 0; i < indices->length(); ++i) {
          int32_t index = indices_values[i];
          // get the offset and size
          int32_t offset = values_offsets[index];
          int64_t length = values_offsets[index + 1] - offset;
          // throw them on the builder
          data_builder.UnsafeAppend(values_data + offset, length);
        }

      } else {
        using arrow::internal::ChunkLocation;
        using arrow::internal::ChunkResolver;

        ChunkResolver resolver(values.chunks());

        std::vector<const uint8_t*> values_data(values.num_chunks());
        std::vector<const int32_t*> values_offsets(values.num_chunks());
        for (int i = 0; i < values.num_chunks(); ++i) {
          values_data[i] = values.chunks()[i]->data()->GetValues<uint8_t>(1);
          values_offsets[i] = values.chunks()[i]->data()->GetValues<int32_t>(2);
        }

        // for each index
        for (int i = 0; i < indices->length(); ++i) {
          // Resolve the location
          ChunkLocation location = resolver.Resolve(indices_values[i]);

          // Get the offset and size
          int32_t offset = values_offsets[location.chunk_index][location.index_in_chunk];
          int32_t length =
              values_offsets[location.chunk_index][location.index_in_chunk + 1] - offset;

          // throw them on the builder
          data_builder.UnsafeAppend(values_data[location.chunk_index] + offset, length);
        }
      }
    }
  }

Sep 20 '22 20:09 wjones127

@wjones127 I think the benchmark test is not doing a fair comparison as there is no need on copying the data and offsets into the temporary std::vectors. The first loop is not necessary. Nevertheless, there are only 10 chunks, so I wouldn't expect a significant penalty from it but better measure than assume.

Also, IIRC the binary search in ChunkResolver performs best when "taking" values pertaining to the same chunk (e.g., large chunks) because the previous chunk is cached.

Sep 21 '22 19:09 edponce

Now, definitely a fixed array will be quicker to access (simple offset calculation) than a binary search across allocations that are not necessarily contiguous in hardware and may even reside in different OS pages. I'd be curious how the benchmark compares when using a large number of chunks: 10, 100, 1000 which is where the concatenation penalty is noticeable. Obviously, the sizes of the chunks also matter.

Sep 21 '22 19:09 edponce

I'd be curious how the benchmark compares when using a large number of chunks: 10, 100, 1000 which is where the concatenation penalty is noticeable. Obviously, the sizes of the chunks also matter.

Reran with a variety of chunk sizes and total array sizes. It seems like ChunkResolver is only better in the extreme case of extremely small chunks (chunk size of 1000 and size of 4,194 means each chunk has about 4 elements).

results

Sep 22 '22 18:09 wjones127

@wjones127 These numbers are for the random or monotonic use case?

Sep 22 '22 19:09 pitrou

@wjones127 Thanks for sharing these benchmarks. Are these results measured without the extra overhead of the temporary std::vector for the ChunkResolver case?

It is reasonable that the smaller the chunk size the better performance ChunkResolver cases are, due to chunk caching and the higher probability of hitting the same chunk consecutively.

Performance is a tricky business bc it depends on the metrics you are evaluating for. Both approaches have advantage and disadvantage. If you have a large dataset and are somewhat memory constrained, the concatenation approach may not be adequate due to the extra storage. The ChunkResolver is the most general solution with least overhead on memory use and still reasonable performance. AFAIK, Arrow does not tracks memory statistics to permit selecting which of these approaches should be used. Well, maybe adding an option for the client code to decide but this does not seem to follow Arrow's general design.

Sep 22 '22 19:09 edponce

@wjones127 These numbers are for the random or monotonic use case?

That's for random. Here is it including monotonic, which makes it more complex:

results

So it seems like it's better in the monotonic case, but worse in the random case. My benchmark code is available here: https://github.com/wjones127/arrow/blob/370d0870c68627224aedcfb79cfd7ceb7d0dfa99/cpp/src/arrow/compute/kernels/vector_selection_benchmark.cc#L206-L277

Are these results measured without the extra overhead of the temporary std::vector for the ChunkResolver case?

I hadn't removed it. Removed it in the test that I'm showing results for above.

The ChunkResolver is the most general solution with least overhead on memory use and still reasonable performance.

In some cases it seems like it would be a serious regression; so I'm trying to figure out which cases those are if we can avoid using ChunkResolver in those cases.

It's hard to say if that extra memory usage is that significant. I feel like some extra memory usage will always happen within compute function. This is large since it needs to operate on the entire chunk, rather than just a chunk at a time. But also with memory pools we can rapidly reuse memory; so I imagine for example if we are running Take() on a Table with multiple string columns, the memory used temporarily for the first one could be re-used when processing the second.

Sep 22 '22 23:09 wjones127

From the results above, before performing the Take operation what information do we know that could allow us to select the adequate strategy?

The main factor driving the differences is the indices access order (random vs monotonic). I do not think we can identify a priori if the take indices are monotonic or random. If so, we can clearly select a strategy. Please correct me if I'm wrong here.
Number of chunks and size we can get from the chunked array.

Now let's try to very hand-wavy summarize some observations based on logical array size.

Random order

1K --> concat is ~2x faster
10K --> concat is ~4x faster
1M and 10M --> concat is ~1.5x faster

Monotonic order

1K and 10K --> concat is significantly faster for up to 10's number of chunks, ChunkResolver is faster for 100 and 1K chunks
1M and 10M --> ChunkResolver is ~1.5x faster

Based on this a general decision rule could be, if indices are random or array size and number of chunks is not that large, use concat, otherwise ChunkResolver. But if we can't identify access order, then it does looks like concat would be a better choice assuming most use cases of Take make random accesses.

An alternative approach could be to add a configuration flag in Arrow that states "optimize for speed or optimize for storage", and this could be used to select strategies all throughout.

Sep 24 '22 02:09 edponce