arrow
arrow copied to clipboard
GH-41640: [Go] Implement BYTE_STREAM_SPLIT Parquet Encoding
Rationale for this change
This encoding is defined by the Parquet spec but does not currently have a Go implementation.
What changes are included in this PR?
Implement BYTE_STREAM_SPLIT encoder/decoder for:
- FIXED_LEN_BYTE_ARRAY
- FLOAT
- DOUBLE
- INT32
- INT64
Are these changes tested?
Yes. See unit tests, file read conformance tests, and benchmarks.
Benchmark results on my machine
➜ go git:(impl-pq-bytestreamsplit) ✗ go test ./parquet/internal/encoding -bench=BenchmarkByteStreamSplit -benchmem
goos: darwin
goarch: arm64
pkg: github.com/apache/arrow/go/v17/parquet/internal/encoding
BenchmarkByteStreamSplitEncodingInt32/len_1024-14 524361 2168 ns/op 1889.22 MB/s 0 B/op 0 allocs/op
BenchmarkByteStreamSplitEncodingInt32/len_2048-14 273886 4362 ns/op 1877.90 MB/s 0 B/op 0 allocs/op
BenchmarkByteStreamSplitEncodingInt32/len_4096-14 138687 8592 ns/op 1906.84 MB/s 0 B/op 0 allocs/op
BenchmarkByteStreamSplitEncodingInt32/len_8192-14 67165 16920 ns/op 1936.61 MB/s 1 B/op 0 allocs/op
BenchmarkByteStreamSplitEncodingInt32/len_16384-14 35726 33853 ns/op 1935.89 MB/s 4 B/op 0 allocs/op
BenchmarkByteStreamSplitEncodingInt32/len_32768-14 17965 66378 ns/op 1974.63 MB/s 15 B/op 0 allocs/op
BenchmarkByteStreamSplitEncodingInt32/len_65536-14 8434 135730 ns/op 1931.36 MB/s 32 B/op 0 allocs/op
BenchmarkByteStreamSplitDecodingInt32/len_1024-14 538546 2116 ns/op 1936.07 MB/s 0 B/op 0 allocs/op
BenchmarkByteStreamSplitDecodingInt32/len_2048-14 274233 4219 ns/op 1941.61 MB/s 0 B/op 0 allocs/op
BenchmarkByteStreamSplitDecodingInt32/len_4096-14 141768 8330 ns/op 1966.89 MB/s 0 B/op 0 allocs/op
BenchmarkByteStreamSplitDecodingInt32/len_8192-14 72373 16608 ns/op 1973.04 MB/s 0 B/op 0 allocs/op
BenchmarkByteStreamSplitDecodingInt32/len_16384-14 35965 33122 ns/op 1978.61 MB/s 0 B/op 0 allocs/op
BenchmarkByteStreamSplitDecodingInt32/len_32768-14 17743 66605 ns/op 1967.90 MB/s 0 B/op 0 allocs/op
BenchmarkByteStreamSplitDecodingInt32/len_65536-14 8268 135196 ns/op 1939.00 MB/s 0 B/op 0 allocs/op
BenchmarkByteStreamSplitEncodingFixedLenByteArray/len_1024-14 1000000 1020 ns/op 2008.02 MB/s 0 B/op 0 allocs/op
BenchmarkByteStreamSplitEncodingFixedLenByteArray/len_2048-14 628276 1941 ns/op 2109.96 MB/s 0 B/op 0 allocs/op
BenchmarkByteStreamSplitEncodingFixedLenByteArray/len_4096-14 302511 3988 ns/op 2054.09 MB/s 0 B/op 0 allocs/op
BenchmarkByteStreamSplitEncodingFixedLenByteArray/len_8192-14 150588 7886 ns/op 2077.55 MB/s 0 B/op 0 allocs/op
BenchmarkByteStreamSplitEncodingFixedLenByteArray/len_16384-14 77336 15645 ns/op 2094.46 MB/s 1 B/op 0 allocs/op
BenchmarkByteStreamSplitEncodingFixedLenByteArray/len_32768-14 39252 30715 ns/op 2133.65 MB/s 3 B/op 0 allocs/op
BenchmarkByteStreamSplitEncodingFixedLenByteArray/len_65536-14 19698 60642 ns/op 2161.39 MB/s 14 B/op 0 allocs/op
BenchmarkByteStreamSplitDecodingFixedLenByteArray/len_1024-14 739568 1596 ns/op 1283.39 MB/s 0 B/op 0 allocs/op
BenchmarkByteStreamSplitDecodingFixedLenByteArray/len_2048-14 387207 3168 ns/op 1293.03 MB/s 0 B/op 0 allocs/op
BenchmarkByteStreamSplitDecodingFixedLenByteArray/len_4096-14 192223 6332 ns/op 1293.83 MB/s 0 B/op 0 allocs/op
BenchmarkByteStreamSplitDecodingFixedLenByteArray/len_8192-14 94545 12618 ns/op 1298.49 MB/s 0 B/op 0 allocs/op
BenchmarkByteStreamSplitDecodingFixedLenByteArray/len_16384-14 47689 25056 ns/op 1307.78 MB/s 0 B/op 0 allocs/op
BenchmarkByteStreamSplitDecodingFixedLenByteArray/len_32768-14 22196 50484 ns/op 1298.16 MB/s 2 B/op 1 allocs/op
BenchmarkByteStreamSplitDecodingFixedLenByteArray/len_65536-14 10000 100918 ns/op 1298.80 MB/s 13 B/op 6 allocs/op
PASS
ok github.com/apache/arrow/go/v17/parquet/internal/encoding 39.946s
Are there any user-facing changes?
New ByteStreamSplit encoding option available. Godoc updated to reflect this.
- GitHub Issue: #41640
:warning: GitHub issue #41640 has been automatically assigned in GitHub to PR creator.
Feel free to ping me if pr is ready
I gave this some thought and realized that most of type widths fall into a small set of values (2, 4, 8 bytes). Given this ends up being a looping value, I've decided to unroll a few of those common loops. No functional changes expected, but the new benchmarks are around 70% faster on encoding and 40% faster on decoding.
➜ go git:(impl-pq-bytestreamsplit) ✗ go test ./parquet/internal/encoding -bench=BenchmarkByteStreamSplit -benchmem
goos: darwin
goarch: arm64
pkg: github.com/apache/arrow/go/v17/parquet/internal/encoding
BenchmarkByteStreamSplitEncodingInt32/len_1024-14 923618 1318 ns/op 3106.89 MB/s 0 B/op 0 allocs/op
BenchmarkByteStreamSplitEncodingInt32/len_2048-14 476294 2588 ns/op 3165.25 MB/s 0 B/op 0 allocs/op
BenchmarkByteStreamSplitEncodingInt32/len_4096-14 241237 5161 ns/op 3174.70 MB/s 0 B/op 0 allocs/op
BenchmarkByteStreamSplitEncodingInt32/len_8192-14 118186 10246 ns/op 3198.09 MB/s 0 B/op 0 allocs/op
BenchmarkByteStreamSplitEncodingInt32/len_16384-14 57086 21394 ns/op 3063.27 MB/s 3 B/op 0 allocs/op
BenchmarkByteStreamSplitEncodingInt32/len_32768-14 28582 42105 ns/op 3112.96 MB/s 9 B/op 0 allocs/op
BenchmarkByteStreamSplitEncodingInt32/len_65536-14 14355 83598 ns/op 3135.79 MB/s 37 B/op 0 allocs/op
BenchmarkByteStreamSplitDecodingInt32/len_1024-14 969301 1275 ns/op 3213.27 MB/s 0 B/op 0 allocs/op
BenchmarkByteStreamSplitDecodingInt32/len_2048-14 491161 2507 ns/op 3268.20 MB/s 0 B/op 0 allocs/op
BenchmarkByteStreamSplitDecodingInt32/len_4096-14 252733 4995 ns/op 3279.87 MB/s 0 B/op 0 allocs/op
BenchmarkByteStreamSplitDecodingInt32/len_8192-14 122080 9982 ns/op 3282.73 MB/s 0 B/op 0 allocs/op
BenchmarkByteStreamSplitDecodingInt32/len_16384-14 57710 19958 ns/op 3283.68 MB/s 0 B/op 0 allocs/op
BenchmarkByteStreamSplitDecodingInt32/len_32768-14 29413 40801 ns/op 3212.44 MB/s 0 B/op 0 allocs/op
BenchmarkByteStreamSplitDecodingInt32/len_65536-14 14929 80434 ns/op 3259.11 MB/s 0 B/op 0 allocs/op
BenchmarkByteStreamSplitEncodingInt64/len_1024-14 483320 2569 ns/op 3188.92 MB/s 0 B/op 0 allocs/op
BenchmarkByteStreamSplitEncodingInt64/len_2048-14 243577 5077 ns/op 3226.84 MB/s 0 B/op 0 allocs/op
BenchmarkByteStreamSplitEncodingInt64/len_4096-14 121550 9968 ns/op 3287.47 MB/s 0 B/op 0 allocs/op
BenchmarkByteStreamSplitEncodingInt64/len_8192-14 59612 20281 ns/op 3231.46 MB/s 2 B/op 0 allocs/op
BenchmarkByteStreamSplitEncodingInt64/len_16384-14 28456 42146 ns/op 3109.98 MB/s 9 B/op 0 allocs/op
BenchmarkByteStreamSplitEncodingInt64/len_32768-14 14380 82850 ns/op 3164.08 MB/s 37 B/op 0 allocs/op
BenchmarkByteStreamSplitEncodingInt64/len_65536-14 7472 171750 ns/op 3052.63 MB/s 71 B/op 0 allocs/op
BenchmarkByteStreamSplitDecodingInt64/len_1024-14 481159 2437 ns/op 3361.81 MB/s 0 B/op 0 allocs/op
BenchmarkByteStreamSplitDecodingInt64/len_2048-14 249433 4876 ns/op 3360.37 MB/s 0 B/op 0 allocs/op
BenchmarkByteStreamSplitDecodingInt64/len_4096-14 124795 9550 ns/op 3431.14 MB/s 0 B/op 0 allocs/op
BenchmarkByteStreamSplitDecodingInt64/len_8192-14 51270 23235 ns/op 2820.57 MB/s 0 B/op 0 allocs/op
BenchmarkByteStreamSplitDecodingInt64/len_16384-14 28407 41773 ns/op 3137.69 MB/s 0 B/op 0 allocs/op
BenchmarkByteStreamSplitDecodingInt64/len_32768-14 14052 83892 ns/op 3124.76 MB/s 0 B/op 0 allocs/op
BenchmarkByteStreamSplitDecodingInt64/len_65536-14 7069 172296 ns/op 3042.94 MB/s 0 B/op 0 allocs/op
BenchmarkByteStreamSplitEncodingFixedLenByteArray/len_1024-14 1621041 741.6 ns/op 2761.67 MB/s 0 B/op 0 allocs/op
BenchmarkByteStreamSplitEncodingFixedLenByteArray/len_2048-14 822153 1483 ns/op 2761.45 MB/s 0 B/op 0 allocs/op
BenchmarkByteStreamSplitEncodingFixedLenByteArray/len_4096-14 406521 2950 ns/op 2777.36 MB/s 0 B/op 0 allocs/op
BenchmarkByteStreamSplitEncodingFixedLenByteArray/len_8192-14 200846 6085 ns/op 2692.40 MB/s 0 B/op 0 allocs/op
BenchmarkByteStreamSplitEncodingFixedLenByteArray/len_16384-14 103134 11823 ns/op 2771.46 MB/s 0 B/op 0 allocs/op
BenchmarkByteStreamSplitEncodingFixedLenByteArray/len_32768-14 51595 23299 ns/op 2812.81 MB/s 2 B/op 0 allocs/op
BenchmarkByteStreamSplitEncodingFixedLenByteArray/len_65536-14 25966 46664 ns/op 2808.85 MB/s 10 B/op 0 allocs/op
BenchmarkByteStreamSplitDecodingFixedLenByteArray/len_1024-14 983942 1284 ns/op 1594.77 MB/s 0 B/op 0 allocs/op
BenchmarkByteStreamSplitDecodingFixedLenByteArray/len_2048-14 471763 2553 ns/op 1604.39 MB/s 0 B/op 0 allocs/op
BenchmarkByteStreamSplitDecodingFixedLenByteArray/len_4096-14 233458 5116 ns/op 1601.27 MB/s 0 B/op 0 allocs/op
BenchmarkByteStreamSplitDecodingFixedLenByteArray/len_8192-14 115778 10291 ns/op 1592.07 MB/s 0 B/op 0 allocs/op
BenchmarkByteStreamSplitDecodingFixedLenByteArray/len_16384-14 58516 20521 ns/op 1596.77 MB/s 0 B/op 0 allocs/op
BenchmarkByteStreamSplitDecodingFixedLenByteArray/len_32768-14 29445 40853 ns/op 1604.17 MB/s 2 B/op 1 allocs/op
BenchmarkByteStreamSplitDecodingFixedLenByteArray/len_65536-14 14677 82440 ns/op 1589.91 MB/s 8 B/op 4 allocs/op
PASS
ok github.com/apache/arrow/go/v17/parquet/internal/encoding 62.918s
@zeroshade @mapleFU I've pushed up the requested changes and also have both batched reads and writes working by buffering the data for each page. See file_writer_test.go for confirmation of this behavior.
The performance is not quite as good as it was without the buffering due to additional allocations and copies. I tried to use buffers from a pool for the encoder, but used regular buffers for the decoder since it doesn't have a Release method in which I could return them to the pool.
Let me know what you think.
@joellubi For arrow C++ usecase, encode is implemented like this patch now, decode, however is implemented by batch:
int DecodeRaw(uint8_t* out_buffer, int max_values) {
const int values_to_decode = std::min(num_values_, max_values);
::arrow::util::internal::ByteStreamSplitDecode(data_, byte_width_, values_to_decode,
stride_, out_buffer);
data_ += values_to_decode;
num_values_ -= values_to_decode;
len_ -= byte_width_ * values_to_decode;
return values_to_decode;
}
Personally I think we can first do like this.
I'm not so familiar with the performance about Go impl part, maybe matt can give better advice
@joellubi For arrow C++ usecase, encode is implemented like this patch now, decode, however is implemented by batch:
Thanks @mapleFU. I think it would be nice to keep the behavior aligned but there is a slight difference between how Go and cpp implementations batch reads.
In cpp, the ReadValues method reads "up to batch_size values from the current data page".
In Go, the readBatch method "will read until it either reads in batchSize values or it hits the end of the column chunk, including reading multiple pages".
Since all values must be decoded within the window of a single page, it's safe to decode the page when SetData is called in Go but an entire batch in general may span multiple pages. In cpp the values read in a single batch is limited to the values left in the current page, so it's safe to read in separate batches without crossing a page boundary.
but there is a slight difference between how Go and cpp implementations batch reads.
Yes, but this doesn't change the syntax for "streaming" read from values, right? I think still we can keep decode as consume small batches from Decoder
Go but an entire batch in general may span multiple pages.
Can we just Decode min(values-current-page, batchRequired) for values in each page?
I think we should at least try @mapleFU's idea to follow the C++ impl and decode via strides incrementally if possible and then compare the benchmarks (possibly make it a config option?)
Ultimately the trade-off here is the current impl requires extra memory to fully decode the entire page when calling SetData to make partial decodes faster, or less memory but partial decodes are more expensive since we'd be jumping around to decode values per stride.
I'd be curious what the effect would be on performance in two scenarios:
- multiple reads of a smaller number to read a whole page
- reading a whole page + part of the next page
Is it worthwhile implementing it? Or should we look at it as a follow-up?
Thanks for the feedback @mapleFU and @zeroshade. I'm playing with batched decoding right now and will update with my findings regarding the performance impact.
@mapleFU @zeroshade I pushed up some changes to the decoders which aligns them more closely to the current cpp implementation. I also added a new benchmark for batched decoding as well. All benchmarks are updated in the PR description.
Overall, the batched approach improves performance slightly across the board for decoding. This is most likely because an intermediary buffer is no longer needed with this approach, and batches can be directly decoded into the output buffer. The new benchmark demonstrates that there's not much of a difference in performance between one-batch-per-page and many-batches-per-page decoding. There may be bigger differences for extremely small batch sizes but I did my best to pick a realistic number. Of course memory usage is less with the batched approach. We write directly into the output buffer and don't have to allocate pageSize bytes per column reader for decoding all at once.
ByteStreamSplit Part LGTM
Overall, the batched approach improves performance slightly across the board for decoding. This is most likely because an intermediary buffer is no longer needed with this approach, and batches can be directly decoded into the output buffer. The new benchmark demonstrates that there's not much of a difference in performance between one-batch-per-page and many-batches-per-page decoding.
Nice to hear that
After merging your PR, Conbench analyzed the 4 benchmarking runs that have been run so far on merge-commit 89fd5664b942f0cec1c51a4a17610aac3015d080.
There were no benchmark performance regressions. 🎉
The full Conbench report has more details. It also includes information about 1 possible false positive for unstable benchmarks that are known to sometimes produce them.