arrow
arrow copied to clipboard
ARROW-11776: [C++][Java] Support parquet write from scanner to file
This PR is aim to support parquet write from scanner to file.
Thanks for opening a pull request!
If this is not a minor PR. Could you open an issue for this pull request on JIRA? https://issues.apache.org/jira/browse/ARROW
Opening JIRAs ahead of time contributes to the Openness of the Apache Arrow project.
Then could you also rename pull request title in the following format?
ARROW-${JIRA_ID}: [${COMPONENT}] ${SUMMARY}
or
MINOR: [${COMPONENT}] ${SUMMARY}
See also:
@JkSelf Can we just use the existing ticket ARROW-11776?
Why isn't this just using the C Data Interface instead of doing things like serializing schemas and manually adapting iterators?
In particular you can export a stream now and that will arrive in C++ as a RecordBatchReader
@zhztheplayer @lidavidm Sorry for the delay response. I have resolved the comments. Please help to review again. Thanks.
CC @lwhite1 @davisusanibar
https://issues.apache.org/jira/browse/ARROW-11776
:warning: Ticket has no components in JIRA, make sure you assign one.
There are lint errors https://github.com/apache/arrow/actions/runs/3290191393/jobs/5426029066#step:6:7990
[WARN] /arrow/java/dataset/src/main/java/org/apache/arrow/dataset/scanner/ArrowScannerReader.java:28: Wrong order for 'java.io.IOException' import. [ImportOrder]
[WARN] /arrow/java/dataset/src/main/java/org/apache/arrow/dataset/scanner/ArrowScannerReader.java:31: Missing a Javadoc comment. [JavadocType]
[WARN] /arrow/java/dataset/src/main/java/org/apache/arrow/dataset/scanner/ArrowScannerReader.java:39:3: Missing a Javadoc comment. [JavadocMethod]
[WARN] /arrow/java/dataset/src/main/java/org/apache/arrow/dataset/scanner/ArrowScannerReader.java:61: 'if' construct must use '{}'s. [NeedBraces]
[WARN] /arrow/java/dataset/src/main/java/org/apache/arrow/dataset/scanner/ArrowScannerReader.java:86:5: WhitespaceAround: 'try' is not followed by whitespace. Empty blocks may only be represented as {} when not part of a multi-block statement (4.1.3) [WhitespaceAround]
[WARN] /arrow/java/dataset/src/main/java/org/apache/arrow/dataset/file/JniWrapper.java:60:50: Parameter name 'stream_address' must match pattern '^[a-z][a-zA-Z0-9]*$'. [ParameterName]
Error: /arrow/java/dataset/src/test/java/org/apache/arrow/dataset/file/TestDatasetFileWriter.java:20:8: Unused import: java.io.ByteArrayOutputStream. [UnusedImports]
Error: /arrow/java/dataset/src/test/java/org/apache/arrow/dataset/file/TestDatasetFileWriter.java:22:8: Unused import: java.io.IOException. [UnusedImports]
Error: /arrow/java/dataset/src/test/java/org/apache/arrow/dataset/file/TestDatasetFileWriter.java:23:8: Unused import: java.nio.channels.Channels. [UnusedImports]
Error: /arrow/java/dataset/src/test/java/org/apache/arrow/dataset/file/TestDatasetFileWriter.java:38:8: Unused import: org.apache.arrow.flatbuf.RecordBatch. [UnusedImports]
Error: /arrow/java/dataset/src/test/java/org/apache/arrow/dataset/file/TestDatasetFileWriter.java:44:8: Unused import: org.apache.arrow.vector.ipc.WriteChannel. [UnusedImports]
Error: /arrow/java/dataset/src/test/java/org/apache/arrow/dataset/file/TestDatasetFileWriter.java:46:8: Unused import: org.apache.arrow.vector.ipc.message.MessageSerializer. [UnusedImports]
@lidavidm Do you have any further comment?
Can you rebase to see if the CI issue is fixed?
Can you rebase to see if the CI issue is fixed?
Rebased.
There's still a lint error here
Warning: src/main/java/org/apache/arrow/dataset/scanner/ArrowScannerReader.java:[43,3] (javadoc) JavadocMethod: Missing a Javadoc comment.
Benchmark runs are scheduled for baseline = a2881a124339d7d50088c5b9778c725316a7003e and contender = dddf38f594a1c0b79cb1ef78eddaafd0288c6bd0. dddf38f594a1c0b79cb1ef78eddaafd0288c6bd0 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Finished :arrow_down:25.0% :arrow_up:0.0%] ec2-t3-xlarge-us-east-2
[Failed :arrow_down:0.56% :arrow_up:0.0%] test-mac-arm
[Finished :arrow_down:0.0% :arrow_up:0.0%] ursa-i9-9960x
[Finished :arrow_down:0.11% :arrow_up:0.0%] ursa-thinkcentre-m75q
Buildkite builds:
[Finished] dddf38f5
ec2-t3-xlarge-us-east-2
[Failed] dddf38f5
test-mac-arm
[Finished] dddf38f5
ursa-i9-9960x
[Finished] dddf38f5
ursa-thinkcentre-m75q
[Finished] a2881a12
ec2-t3-xlarge-us-east-2
[Failed] a2881a12
test-mac-arm
[Finished] a2881a12
ursa-i9-9960x
[Finished] a2881a12
ursa-thinkcentre-m75q
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java
['Python', 'R'] benchmarks have high level of regressions. test-mac-arm