arrow GH-41640: [Go] Implement BYTE_STREAM

trafficstars

Rationale for this change

This encoding is defined by the Parquet spec but does not currently have a Go implementation.

What changes are included in this PR?

Implement BYTE_STREAM_SPLIT encoder/decoder for:

FIXED_LEN_BYTE_ARRAY
FLOAT
DOUBLE
INT32
INT64

Are these changes tested?

Yes. See unit tests, file read conformance tests, and benchmarks.

Benchmark results on my machine

➜  go git:(impl-pq-bytestreamsplit) ✗ go test ./parquet/internal/encoding -bench=BenchmarkByteStreamSplit -benchmem
goos: darwin
goarch: arm64
pkg: github.com/apache/arrow/go/v17/parquet/internal/encoding
BenchmarkByteStreamSplitEncodingInt32/len_1024-14                 524361              2168 ns/op        1889.22 MB/s           0 B/op          0 allocs/op
BenchmarkByteStreamSplitEncodingInt32/len_2048-14                 273886              4362 ns/op        1877.90 MB/s           0 B/op          0 allocs/op
BenchmarkByteStreamSplitEncodingInt32/len_4096-14                 138687              8592 ns/op        1906.84 MB/s           0 B/op          0 allocs/op
BenchmarkByteStreamSplitEncodingInt32/len_8192-14                  67165             16920 ns/op        1936.61 MB/s           1 B/op          0 allocs/op
BenchmarkByteStreamSplitEncodingInt32/len_16384-14                 35726             33853 ns/op        1935.89 MB/s           4 B/op          0 allocs/op
BenchmarkByteStreamSplitEncodingInt32/len_32768-14                 17965             66378 ns/op        1974.63 MB/s          15 B/op          0 allocs/op
BenchmarkByteStreamSplitEncodingInt32/len_65536-14                  8434            135730 ns/op        1931.36 MB/s          32 B/op          0 allocs/op
BenchmarkByteStreamSplitDecodingInt32/len_1024-14                 538546              2116 ns/op        1936.07 MB/s           0 B/op          0 allocs/op
BenchmarkByteStreamSplitDecodingInt32/len_2048-14                 274233              4219 ns/op        1941.61 MB/s           0 B/op          0 allocs/op
BenchmarkByteStreamSplitDecodingInt32/len_4096-14                 141768              8330 ns/op        1966.89 MB/s           0 B/op          0 allocs/op
BenchmarkByteStreamSplitDecodingInt32/len_8192-14                  72373             16608 ns/op        1973.04 MB/s           0 B/op          0 allocs/op
BenchmarkByteStreamSplitDecodingInt32/len_16384-14                 35965             33122 ns/op        1978.61 MB/s           0 B/op          0 allocs/op
BenchmarkByteStreamSplitDecodingInt32/len_32768-14                 17743             66605 ns/op        1967.90 MB/s           0 B/op          0 allocs/op
BenchmarkByteStreamSplitDecodingInt32/len_65536-14                  8268            135196 ns/op        1939.00 MB/s           0 B/op          0 allocs/op
BenchmarkByteStreamSplitEncodingFixedLenByteArray/len_1024-14            1000000              1020 ns/op        2008.02 MB/s           0 B/op          0 allocs/op
BenchmarkByteStreamSplitEncodingFixedLenByteArray/len_2048-14             628276              1941 ns/op        2109.96 MB/s           0 B/op          0 allocs/op
BenchmarkByteStreamSplitEncodingFixedLenByteArray/len_4096-14             302511              3988 ns/op        2054.09 MB/s           0 B/op          0 allocs/op
BenchmarkByteStreamSplitEncodingFixedLenByteArray/len_8192-14             150588              7886 ns/op        2077.55 MB/s           0 B/op          0 allocs/op
BenchmarkByteStreamSplitEncodingFixedLenByteArray/len_16384-14             77336             15645 ns/op        2094.46 MB/s           1 B/op          0 allocs/op
BenchmarkByteStreamSplitEncodingFixedLenByteArray/len_32768-14             39252             30715 ns/op        2133.65 MB/s           3 B/op          0 allocs/op
BenchmarkByteStreamSplitEncodingFixedLenByteArray/len_65536-14             19698             60642 ns/op        2161.39 MB/s          14 B/op          0 allocs/op
BenchmarkByteStreamSplitDecodingFixedLenByteArray/len_1024-14             739568              1596 ns/op        1283.39 MB/s           0 B/op          0 allocs/op
BenchmarkByteStreamSplitDecodingFixedLenByteArray/len_2048-14             387207              3168 ns/op        1293.03 MB/s           0 B/op          0 allocs/op
BenchmarkByteStreamSplitDecodingFixedLenByteArray/len_4096-14             192223              6332 ns/op        1293.83 MB/s           0 B/op          0 allocs/op
BenchmarkByteStreamSplitDecodingFixedLenByteArray/len_8192-14              94545             12618 ns/op        1298.49 MB/s           0 B/op          0 allocs/op
BenchmarkByteStreamSplitDecodingFixedLenByteArray/len_16384-14             47689             25056 ns/op        1307.78 MB/s           0 B/op          0 allocs/op
BenchmarkByteStreamSplitDecodingFixedLenByteArray/len_32768-14             22196             50484 ns/op        1298.16 MB/s           2 B/op          1 allocs/op
BenchmarkByteStreamSplitDecodingFixedLenByteArray/len_65536-14             10000            100918 ns/op        1298.80 MB/s          13 B/op          6 allocs/op
PASS
ok      github.com/apache/arrow/go/v17/parquet/internal/encoding        39.946s

Are there any user-facing changes?

New ByteStreamSplit encoding option available. Godoc updated to reflect this.

GitHub Issue: #41640

Jun 26 '24 14:06 joellubi

:warning: GitHub issue #41640 has been automatically assigned in GitHub to PR creator.

Jun 26 '24 14:06 github-actions[bot]

Feel free to ping me if pr is ready

Jun 26 '24 14:06 mapleFU

I gave this some thought and realized that most of type widths fall into a small set of values (2, 4, 8 bytes). Given this ends up being a looping value, I've decided to unroll a few of those common loops. No functional changes expected, but the new benchmarks are around 70% faster on encoding and 40% faster on decoding.

➜  go git:(impl-pq-bytestreamsplit) ✗ go test ./parquet/internal/encoding -bench=BenchmarkByteStreamSplit -benchmem
goos: darwin
goarch: arm64
pkg: github.com/apache/arrow/go/v17/parquet/internal/encoding
BenchmarkByteStreamSplitEncodingInt32/len_1024-14                 923618              1318 ns/op        3106.89 MB/s           0 B/op          0 allocs/op
BenchmarkByteStreamSplitEncodingInt32/len_2048-14                 476294              2588 ns/op        3165.25 MB/s           0 B/op          0 allocs/op
BenchmarkByteStreamSplitEncodingInt32/len_4096-14                 241237              5161 ns/op        3174.70 MB/s           0 B/op          0 allocs/op
BenchmarkByteStreamSplitEncodingInt32/len_8192-14                 118186             10246 ns/op        3198.09 MB/s           0 B/op          0 allocs/op
BenchmarkByteStreamSplitEncodingInt32/len_16384-14                 57086             21394 ns/op        3063.27 MB/s           3 B/op          0 allocs/op
BenchmarkByteStreamSplitEncodingInt32/len_32768-14                 28582             42105 ns/op        3112.96 MB/s           9 B/op          0 allocs/op
BenchmarkByteStreamSplitEncodingInt32/len_65536-14                 14355             83598 ns/op        3135.79 MB/s          37 B/op          0 allocs/op
BenchmarkByteStreamSplitDecodingInt32/len_1024-14                 969301              1275 ns/op        3213.27 MB/s           0 B/op          0 allocs/op
BenchmarkByteStreamSplitDecodingInt32/len_2048-14                 491161              2507 ns/op        3268.20 MB/s           0 B/op          0 allocs/op
BenchmarkByteStreamSplitDecodingInt32/len_4096-14                 252733              4995 ns/op        3279.87 MB/s           0 B/op          0 allocs/op
BenchmarkByteStreamSplitDecodingInt32/len_8192-14                 122080              9982 ns/op        3282.73 MB/s           0 B/op          0 allocs/op
BenchmarkByteStreamSplitDecodingInt32/len_16384-14                 57710             19958 ns/op        3283.68 MB/s           0 B/op          0 allocs/op
BenchmarkByteStreamSplitDecodingInt32/len_32768-14                 29413             40801 ns/op        3212.44 MB/s           0 B/op          0 allocs/op
BenchmarkByteStreamSplitDecodingInt32/len_65536-14                 14929             80434 ns/op        3259.11 MB/s           0 B/op          0 allocs/op
BenchmarkByteStreamSplitEncodingInt64/len_1024-14                 483320              2569 ns/op        3188.92 MB/s           0 B/op          0 allocs/op
BenchmarkByteStreamSplitEncodingInt64/len_2048-14                 243577              5077 ns/op        3226.84 MB/s           0 B/op          0 allocs/op
BenchmarkByteStreamSplitEncodingInt64/len_4096-14                 121550              9968 ns/op        3287.47 MB/s           0 B/op          0 allocs/op
BenchmarkByteStreamSplitEncodingInt64/len_8192-14                  59612             20281 ns/op        3231.46 MB/s           2 B/op          0 allocs/op
BenchmarkByteStreamSplitEncodingInt64/len_16384-14                 28456             42146 ns/op        3109.98 MB/s           9 B/op          0 allocs/op
BenchmarkByteStreamSplitEncodingInt64/len_32768-14                 14380             82850 ns/op        3164.08 MB/s          37 B/op          0 allocs/op
BenchmarkByteStreamSplitEncodingInt64/len_65536-14                  7472            171750 ns/op        3052.63 MB/s          71 B/op          0 allocs/op
BenchmarkByteStreamSplitDecodingInt64/len_1024-14                 481159              2437 ns/op        3361.81 MB/s           0 B/op          0 allocs/op
BenchmarkByteStreamSplitDecodingInt64/len_2048-14                 249433              4876 ns/op        3360.37 MB/s           0 B/op          0 allocs/op
BenchmarkByteStreamSplitDecodingInt64/len_4096-14                 124795              9550 ns/op        3431.14 MB/s           0 B/op          0 allocs/op
BenchmarkByteStreamSplitDecodingInt64/len_8192-14                  51270             23235 ns/op        2820.57 MB/s           0 B/op          0 allocs/op
BenchmarkByteStreamSplitDecodingInt64/len_16384-14                 28407             41773 ns/op        3137.69 MB/s           0 B/op          0 allocs/op
BenchmarkByteStreamSplitDecodingInt64/len_32768-14                 14052             83892 ns/op        3124.76 MB/s           0 B/op          0 allocs/op
BenchmarkByteStreamSplitDecodingInt64/len_65536-14                  7069            172296 ns/op        3042.94 MB/s           0 B/op          0 allocs/op
BenchmarkByteStreamSplitEncodingFixedLenByteArray/len_1024-14            1621041               741.6 ns/op      2761.67 MB/s           0 B/op          0 allocs/op
BenchmarkByteStreamSplitEncodingFixedLenByteArray/len_2048-14             822153              1483 ns/op        2761.45 MB/s           0 B/op          0 allocs/op
BenchmarkByteStreamSplitEncodingFixedLenByteArray/len_4096-14             406521              2950 ns/op        2777.36 MB/s           0 B/op          0 allocs/op
BenchmarkByteStreamSplitEncodingFixedLenByteArray/len_8192-14             200846              6085 ns/op        2692.40 MB/s           0 B/op          0 allocs/op
BenchmarkByteStreamSplitEncodingFixedLenByteArray/len_16384-14            103134             11823 ns/op        2771.46 MB/s           0 B/op          0 allocs/op
BenchmarkByteStreamSplitEncodingFixedLenByteArray/len_32768-14             51595             23299 ns/op        2812.81 MB/s           2 B/op          0 allocs/op
BenchmarkByteStreamSplitEncodingFixedLenByteArray/len_65536-14             25966             46664 ns/op        2808.85 MB/s          10 B/op          0 allocs/op
BenchmarkByteStreamSplitDecodingFixedLenByteArray/len_1024-14             983942              1284 ns/op        1594.77 MB/s           0 B/op          0 allocs/op
BenchmarkByteStreamSplitDecodingFixedLenByteArray/len_2048-14             471763              2553 ns/op        1604.39 MB/s           0 B/op          0 allocs/op
BenchmarkByteStreamSplitDecodingFixedLenByteArray/len_4096-14             233458              5116 ns/op        1601.27 MB/s           0 B/op          0 allocs/op
BenchmarkByteStreamSplitDecodingFixedLenByteArray/len_8192-14             115778             10291 ns/op        1592.07 MB/s           0 B/op          0 allocs/op
BenchmarkByteStreamSplitDecodingFixedLenByteArray/len_16384-14             58516             20521 ns/op        1596.77 MB/s           0 B/op          0 allocs/op
BenchmarkByteStreamSplitDecodingFixedLenByteArray/len_32768-14             29445             40853 ns/op        1604.17 MB/s           2 B/op          1 allocs/op
BenchmarkByteStreamSplitDecodingFixedLenByteArray/len_65536-14             14677             82440 ns/op        1589.91 MB/s           8 B/op          4 allocs/op
PASS
ok      github.com/apache/arrow/go/v17/parquet/internal/encoding        62.918s

Jul 01 '24 14:07 joellubi

@zeroshade @mapleFU I've pushed up the requested changes and also have both batched reads and writes working by buffering the data for each page. See file_writer_test.go for confirmation of this behavior.

The performance is not quite as good as it was without the buffering due to additional allocations and copies. I tried to use buffers from a pool for the encoder, but used regular buffers for the decoder since it doesn't have a Release method in which I could return them to the pool.

Let me know what you think.

Jul 03 '24 21:07 joellubi

@joellubi For arrow C++ usecase, encode is implemented like this patch now, decode, however is implemented by batch:

  int DecodeRaw(uint8_t* out_buffer, int max_values) {
    const int values_to_decode = std::min(num_values_, max_values);
    ::arrow::util::internal::ByteStreamSplitDecode(data_, byte_width_, values_to_decode,
                                                   stride_, out_buffer);
    data_ += values_to_decode;
    num_values_ -= values_to_decode;
    len_ -= byte_width_ * values_to_decode;
    return values_to_decode;
  }

Personally I think we can first do like this.

I'm not so familiar with the performance about Go impl part, maybe matt can give better advice

Jul 04 '24 02:07 mapleFU

@joellubi For arrow C++ usecase, encode is implemented like this patch now, decode, however is implemented by batch:

Thanks @mapleFU. I think it would be nice to keep the behavior aligned but there is a slight difference between how Go and cpp implementations batch reads.

In cpp, the ReadValues method reads "up to batch_size values from the current data page".

In Go, the readBatch method "will read until it either reads in batchSize values or it hits the end of the column chunk, including reading multiple pages".

Since all values must be decoded within the window of a single page, it's safe to decode the page when SetData is called in Go but an entire batch in general may span multiple pages. In cpp the values read in a single batch is limited to the values left in the current page, so it's safe to read in separate batches without crossing a page boundary.

Jul 05 '24 18:07 joellubi

but there is a slight difference between how Go and cpp implementations batch reads.

Yes, but this doesn't change the syntax for "streaming" read from values, right? I think still we can keep decode as consume small batches from Decoder

Go but an entire batch in general may span multiple pages.

Can we just Decode min(values-current-page, batchRequired) for values in each page?

Jul 05 '24 18:07 mapleFU

I think we should at least try @mapleFU's idea to follow the C++ impl and decode via strides incrementally if possible and then compare the benchmarks (possibly make it a config option?)

Ultimately the trade-off here is the current impl requires extra memory to fully decode the entire page when calling SetData to make partial decodes faster, or less memory but partial decodes are more expensive since we'd be jumping around to decode values per stride.

I'd be curious what the effect would be on performance in two scenarios:

multiple reads of a smaller number to read a whole page
reading a whole page + part of the next page

Is it worthwhile implementing it? Or should we look at it as a follow-up?

Jul 08 '24 15:07 zeroshade

Thanks for the feedback @mapleFU and @zeroshade. I'm playing with batched decoding right now and will update with my findings regarding the performance impact.

Jul 08 '24 15:07 joellubi

@mapleFU @zeroshade I pushed up some changes to the decoders which aligns them more closely to the current cpp implementation. I also added a new benchmark for batched decoding as well. All benchmarks are updated in the PR description.

Overall, the batched approach improves performance slightly across the board for decoding. This is most likely because an intermediary buffer is no longer needed with this approach, and batches can be directly decoded into the output buffer. The new benchmark demonstrates that there's not much of a difference in performance between one-batch-per-page and many-batches-per-page decoding. There may be bigger differences for extremely small batch sizes but I did my best to pick a realistic number. Of course memory usage is less with the batched approach. We write directly into the output buffer and don't have to allocate pageSize bytes per column reader for decoding all at once.

Jul 08 '24 19:07 joellubi

ByteStreamSplit Part LGTM

Overall, the batched approach improves performance slightly across the board for decoding. This is most likely because an intermediary buffer is no longer needed with this approach, and batches can be directly decoded into the output buffer. The new benchmark demonstrates that there's not much of a difference in performance between one-batch-per-page and many-batches-per-page decoding.

Nice to hear that

Jul 09 '24 06:07 mapleFU

After merging your PR, Conbench analyzed the 4 benchmarking runs that have been run so far on merge-commit 89fd5664b942f0cec1c51a4a17610aac3015d080.

There were no benchmark performance regressions. 🎉

The full Conbench report has more details. It also includes information about 1 possible false positive for unstable benchmarks that are known to sometimes produce them.

Jul 10 '24 01:07 conbench-apache-arrow[bot]

arrow
arrow copied to clipboard

GH-41640: [Go] Implement BYTE_STREAM_SPLIT Parquet Encoding

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

arrow arrow copied to clipboard

GH-41640: [Go] Implement BYTE_STREAM_SPLIT Parquet Encoding

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

arrow
arrow copied to clipboard