arrow icon indicating copy to clipboard operation
arrow copied to clipboard

ARROW-17932: [C++] Implement streaming RecordBatchReader for JSON

Open benibus opened this issue 3 years ago • 3 comments

See: ARROW-17932.

This adds a json::StreamingReader class modeled after csv::StreamingReader. Some parts of the existing json::TableReader implementation have been refactored to utilize the new facilities.

benibus avatar Oct 09 '22 20:10 benibus

https://issues.apache.org/jira/browse/ARROW-17932

github-actions[bot] avatar Oct 09 '22 20:10 github-actions[bot]

A couple notes on this:

  • I decided to hold off on documenting threading/reentrancy characteristics until I could get a second opinion, as this was my first exposure to the project's async facilities. As it's written, the stated limitations of the CSV equivalent should still apply (no async-reentrancy, no multi-threading). However, here, it seems possible to launch the parsing/decoding tasks in parallel once a chunked block has been serially generated. There may be complexities I'm not considering though.
  • Test coverage handles the common cases, but likely isn't comprehensive enough - although some of it should be handled by the existing TableReader tests since it uses common code.

benibus avatar Oct 10 '22 18:10 benibus

Some answers to the points above:

  • It seems like this may be async-reentrant (and theoretically could be, as you point out)

  • Tests should indeed be complemented with: 1) tests for errors 2) tests with multiple input blocks 3) stress tests, especially async-reentrant calls to ReadNextAsync (MakeReadaheadGenerator may help with that)

pitrou avatar Oct 12 '22 16:10 pitrou

By the way, can you also:

  1. rebase/merge from master
  2. add documentation for the new class in https://github.com/apache/arrow/blob/master/docs/source/cpp/json.rst and https://github.com/apache/arrow/blob/master/docs/source/cpp/api/formats.rst#line-separated-json ?

pitrou avatar Nov 30 '22 18:11 pitrou

@pitrou I still have to do the docs, but feel free to bring up any further points on the code in the meantime.

benibus avatar Dec 01 '22 18:12 benibus

I just opened a small PR that addresses the chunker issue: https://github.com/apache/arrow/pull/14843. If that gets merged, I have a test I can add for the newlines_in_values = true case (which would crash otherwise). Then, this will be ready for re-review.

benibus avatar Dec 05 '22 15:12 benibus

@pitrou Just pushed a fix.

Also, not sure if my rationale on the std::remove table validation stuff from yesterday is sound - but if not, I can change that too.

benibus avatar Dec 07 '22 16:12 benibus

Benchmark runs are scheduled for baseline = 375372b7ca64a8e9ce8a0fabb5e3c5637ba895b3 and contender = fe5c9fe3cb61c2c3473d0eef3119eb53b4cad30f. fe5c9fe3cb61c2c3473d0eef3119eb53b4cad30f is a master commit associated with this PR. Results will be available as each benchmark for each run completes. Conbench compare runs links: [Finished :arrow_down:0.0% :arrow_up:0.0%] ec2-t3-xlarge-us-east-2 [Finished :arrow_down:0.13% :arrow_up:0.23%] test-mac-arm [Finished :arrow_down:1.36% :arrow_up:4.62%] ursa-i9-9960x [Finished :arrow_down:14.03% :arrow_up:9.62%] ursa-thinkcentre-m75q Buildkite builds: [Finished] fe5c9fe3 ec2-t3-xlarge-us-east-2 [Finished] fe5c9fe3 test-mac-arm [Finished] fe5c9fe3 ursa-i9-9960x [Finished] fe5c9fe3 ursa-thinkcentre-m75q [Finished] 375372b7 ec2-t3-xlarge-us-east-2 [Finished] 375372b7 test-mac-arm [Finished] 375372b7 ursa-i9-9960x [Finished] 375372b7 ursa-thinkcentre-m75q Supported benchmarks: ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True test-mac-arm: Supported benchmark langs: C++, Python, R ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

ursabot avatar Dec 14 '22 15:12 ursabot

['Python', 'R'] benchmarks have high level of regressions. ursa-i9-9960x

ursabot avatar Dec 14 '22 15:12 ursabot