ARROW-17932: [C++] Implement streaming RecordBatchReader for JSON
See: ARROW-17932.
This adds a json::StreamingReader class modeled after csv::StreamingReader. Some parts of the existing json::TableReader implementation have been refactored to utilize the new facilities.
https://issues.apache.org/jira/browse/ARROW-17932
A couple notes on this:
- I decided to hold off on documenting threading/reentrancy characteristics until I could get a second opinion, as this was my first exposure to the project's async facilities. As it's written, the stated limitations of the CSV equivalent should still apply (no async-reentrancy, no multi-threading). However, here, it seems possible to launch the parsing/decoding tasks in parallel once a chunked block has been serially generated. There may be complexities I'm not considering though.
- Test coverage handles the common cases, but likely isn't comprehensive enough - although some of it should be handled by the existing
TableReadertests since it uses common code.
Some answers to the points above:
-
It seems like this may be async-reentrant (and theoretically could be, as you point out)
-
Tests should indeed be complemented with: 1) tests for errors 2) tests with multiple input blocks 3) stress tests, especially async-reentrant calls to
ReadNextAsync(MakeReadaheadGeneratormay help with that)
By the way, can you also:
- rebase/merge from master
- add documentation for the new class in https://github.com/apache/arrow/blob/master/docs/source/cpp/json.rst and https://github.com/apache/arrow/blob/master/docs/source/cpp/api/formats.rst#line-separated-json ?
@pitrou I still have to do the docs, but feel free to bring up any further points on the code in the meantime.
I just opened a small PR that addresses the chunker issue: https://github.com/apache/arrow/pull/14843. If that gets merged, I have a test I can add for the newlines_in_values = true case (which would crash otherwise). Then, this will be ready for re-review.
@pitrou Just pushed a fix.
Also, not sure if my rationale on the std::remove table validation stuff from yesterday is sound - but if not, I can change that too.
Benchmark runs are scheduled for baseline = 375372b7ca64a8e9ce8a0fabb5e3c5637ba895b3 and contender = fe5c9fe3cb61c2c3473d0eef3119eb53b4cad30f. fe5c9fe3cb61c2c3473d0eef3119eb53b4cad30f is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Finished :arrow_down:0.0% :arrow_up:0.0%] ec2-t3-xlarge-us-east-2
[Finished :arrow_down:0.13% :arrow_up:0.23%] test-mac-arm
[Finished :arrow_down:1.36% :arrow_up:4.62%] ursa-i9-9960x
[Finished :arrow_down:14.03% :arrow_up:9.62%] ursa-thinkcentre-m75q
Buildkite builds:
[Finished] fe5c9fe3 ec2-t3-xlarge-us-east-2
[Finished] fe5c9fe3 test-mac-arm
[Finished] fe5c9fe3 ursa-i9-9960x
[Finished] fe5c9fe3 ursa-thinkcentre-m75q
[Finished] 375372b7 ec2-t3-xlarge-us-east-2
[Finished] 375372b7 test-mac-arm
[Finished] 375372b7 ursa-i9-9960x
[Finished] 375372b7 ursa-thinkcentre-m75q
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java
['Python', 'R'] benchmarks have high level of regressions. ursa-i9-9960x