arrow icon indicating copy to clipboard operation
arrow copied to clipboard

ARROW-11776: [C++][Java] Support parquet write from scanner to file

Open JkSelf opened this issue 2 years ago • 8 comments

This PR is aim to support parquet write from scanner to file.

JkSelf avatar Sep 16 '22 03:09 JkSelf

Thanks for opening a pull request!

If this is not a minor PR. Could you open an issue for this pull request on JIRA? https://issues.apache.org/jira/browse/ARROW

Opening JIRAs ahead of time contributes to the Openness of the Apache Arrow project.

Then could you also rename pull request title in the following format?

ARROW-${JIRA_ID}: [${COMPONENT}] ${SUMMARY}

or

MINOR: [${COMPONENT}] ${SUMMARY}

See also:

github-actions[bot] avatar Sep 16 '22 03:09 github-actions[bot]

@JkSelf Can we just use the existing ticket ARROW-11776?

zhztheplayer avatar Sep 16 '22 07:09 zhztheplayer

Why isn't this just using the C Data Interface instead of doing things like serializing schemas and manually adapting iterators?

lidavidm avatar Sep 16 '22 11:09 lidavidm

In particular you can export a stream now and that will arrive in C++ as a RecordBatchReader

lidavidm avatar Sep 16 '22 11:09 lidavidm

@zhztheplayer @lidavidm Sorry for the delay response. I have resolved the comments. Please help to review again. Thanks.

JkSelf avatar Sep 23 '22 06:09 JkSelf

CC @lwhite1 @davisusanibar

lidavidm avatar Sep 30 '22 15:09 lidavidm

https://issues.apache.org/jira/browse/ARROW-11776

github-actions[bot] avatar Sep 30 '22 15:09 github-actions[bot]

:warning: Ticket has no components in JIRA, make sure you assign one.

github-actions[bot] avatar Sep 30 '22 15:09 github-actions[bot]

There are lint errors https://github.com/apache/arrow/actions/runs/3290191393/jobs/5426029066#step:6:7990


[WARN] /arrow/java/dataset/src/main/java/org/apache/arrow/dataset/scanner/ArrowScannerReader.java:28: Wrong order for 'java.io.IOException' import. [ImportOrder]
[WARN] /arrow/java/dataset/src/main/java/org/apache/arrow/dataset/scanner/ArrowScannerReader.java:31: Missing a Javadoc comment. [JavadocType]
[WARN] /arrow/java/dataset/src/main/java/org/apache/arrow/dataset/scanner/ArrowScannerReader.java:39:3: Missing a Javadoc comment. [JavadocMethod]
[WARN] /arrow/java/dataset/src/main/java/org/apache/arrow/dataset/scanner/ArrowScannerReader.java:61: 'if' construct must use '{}'s. [NeedBraces]
[WARN] /arrow/java/dataset/src/main/java/org/apache/arrow/dataset/scanner/ArrowScannerReader.java:86:5: WhitespaceAround: 'try' is not followed by whitespace. Empty blocks may only be represented as {} when not part of a multi-block statement (4.1.3) [WhitespaceAround]
[WARN] /arrow/java/dataset/src/main/java/org/apache/arrow/dataset/file/JniWrapper.java:60:50: Parameter name 'stream_address' must match pattern '^[a-z][a-zA-Z0-9]*$'. [ParameterName]
Error:  /arrow/java/dataset/src/test/java/org/apache/arrow/dataset/file/TestDatasetFileWriter.java:20:8: Unused import: java.io.ByteArrayOutputStream. [UnusedImports]
Error:  /arrow/java/dataset/src/test/java/org/apache/arrow/dataset/file/TestDatasetFileWriter.java:22:8: Unused import: java.io.IOException. [UnusedImports]
Error:  /arrow/java/dataset/src/test/java/org/apache/arrow/dataset/file/TestDatasetFileWriter.java:23:8: Unused import: java.nio.channels.Channels. [UnusedImports]
Error:  /arrow/java/dataset/src/test/java/org/apache/arrow/dataset/file/TestDatasetFileWriter.java:38:8: Unused import: org.apache.arrow.flatbuf.RecordBatch. [UnusedImports]
Error:  /arrow/java/dataset/src/test/java/org/apache/arrow/dataset/file/TestDatasetFileWriter.java:44:8: Unused import: org.apache.arrow.vector.ipc.WriteChannel. [UnusedImports]
Error:  /arrow/java/dataset/src/test/java/org/apache/arrow/dataset/file/TestDatasetFileWriter.java:46:8: Unused import: org.apache.arrow.vector.ipc.message.MessageSerializer. [UnusedImports]

lidavidm avatar Oct 24 '22 12:10 lidavidm

@lidavidm Do you have any further comment?

JkSelf avatar Oct 26 '22 05:10 JkSelf

Can you rebase to see if the CI issue is fixed?

lidavidm avatar Oct 26 '22 22:10 lidavidm

Can you rebase to see if the CI issue is fixed?

Rebased.

JkSelf avatar Oct 27 '22 07:10 JkSelf

There's still a lint error here

Warning:  src/main/java/org/apache/arrow/dataset/scanner/ArrowScannerReader.java:[43,3] (javadoc) JavadocMethod: Missing a Javadoc comment.

lidavidm avatar Oct 27 '22 16:10 lidavidm

Benchmark runs are scheduled for baseline = a2881a124339d7d50088c5b9778c725316a7003e and contender = dddf38f594a1c0b79cb1ef78eddaafd0288c6bd0. dddf38f594a1c0b79cb1ef78eddaafd0288c6bd0 is a master commit associated with this PR. Results will be available as each benchmark for each run completes. Conbench compare runs links: [Finished :arrow_down:25.0% :arrow_up:0.0%] ec2-t3-xlarge-us-east-2 [Failed :arrow_down:0.56% :arrow_up:0.0%] test-mac-arm [Finished :arrow_down:0.0% :arrow_up:0.0%] ursa-i9-9960x [Finished :arrow_down:0.11% :arrow_up:0.0%] ursa-thinkcentre-m75q Buildkite builds: [Finished] dddf38f5 ec2-t3-xlarge-us-east-2 [Failed] dddf38f5 test-mac-arm [Finished] dddf38f5 ursa-i9-9960x [Finished] dddf38f5 ursa-thinkcentre-m75q [Finished] a2881a12 ec2-t3-xlarge-us-east-2 [Failed] a2881a12 test-mac-arm [Finished] a2881a12 ursa-i9-9960x [Finished] a2881a12 ursa-thinkcentre-m75q Supported benchmarks: ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True test-mac-arm: Supported benchmark langs: C++, Python, R ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

ursabot avatar Oct 29 '22 20:10 ursabot

['Python', 'R'] benchmarks have high level of regressions. test-mac-arm

ursabot avatar Oct 29 '22 20:10 ursabot