arrow
arrow copied to clipboard
ARROW-17980: [C++] As-of-Join Substrait extension
Replacing https://github.com/apache/arrow/pull/14385
https://issues.apache.org/jira/browse/ARROW-17980
I'm a little torn on this PR. It seems that it includes #14415 which hasn't merged yet, although that is primarily because it blocked waiting for substrait-io/substrait#342 . At this point I think @bkietz remaining comments were fairly minor and I had addressed them. Also, since things are a bit squashed, it is unclear to me how much of #14415 this includes. Do you know if you were able to squash in all the commits from #14415 ?
Unfortunately, I can't say whether this PR includes all of #14415. To avoid confusion, I'm fine with marking this PR a draft. I'll hold to see how you and @bkietz suggest we should proceed.
I think before merging we should address if this is behaviour we want to keep.
@anjakefala It is not. I would like to hide protobuf from python (I suspect an options_internal.h is doable somehow).
Now that #14415 has merged we can move forwards on this. However, the code here is out of date (it was based on an older revision of #14415) and I worry it might not merge so cleanly. Let me know if you have trouble and I can take a stab at it.
I will try to create a replacement PR.
I ended up merging. @westonpace, does it look like what you expected?
How to deal with the following CI job's error?
/arrow/cpp/src/arrow/engine/substrait/options.h:37: error: documented symbol 'enum ARROW_ENGINE_EXPORT arrow::engine::arrow::engine::ConversionStrictness' was not declared or defined. (warning treated as error, aborting now)
I'm not a big fan of the fact that we need the using statements for the substrait namespace but if we really want that package name then it seems protobuf gives us little choice.
@westonpace, note that this PR is currently using a substrait_ext namespace
How to deal with the following CI job's error?
/arrow/cpp/src/arrow/engine/substrait/options.h:37: error: documented symbol 'enum ARROW_ENGINE_EXPORT arrow::engine::arrow::engine::ConversionStrictness' was not declared or defined. (warning treated as error, aborting now)
Fixed in https://issues.apache.org/jira/browse/ARROW-18424
@rtpsw Looks like the merge didn't go through cleanly. Let me know if you want me to clean this up. Have you been able to address the concern here: https://github.com/apache/arrow/pull/14485#pullrequestreview-1167652534 ?
@rtpsw Looks like the merge didn't go through cleanly. Let me know if you want me to clean this up.
Note that this is a result of a GitHub automated rebase. @westonpace, if you can easily clean this up, that would be helpful.
Have you been able to address the concern here: #14485 (review) ?
I have removed the dependence on protobuf in the header, but I haven't added the proposed cmake snippet. It sounded like the place for this snippet is under discussion. I'm also not sure whether it is required in this PR or can be deferred.
@rtpsw
I believe I have captured all the changes that you made. Though it is difficult to cherry pick the changes you might have made during merging. Can you please compare this now with your local branch and, if satisfied, let me know. Otherwise, if this seems off. You can always force push back to your remote.
Most CI failures seem unrelated, in particular this Substrait-test failure:
D:/a/arrow/arrow/cpp/src/arrow/status.cc:137: When trying to delete temporary directory: IOError: Cannot delete file 'C:/Users/RUNNER~1/AppData/Local/Temp/substrait-tempdir-q1gzn3f7/serde_test.arrow': . Detail: [Windows error 32] The process cannot access the file because it is being used by another process.
...
[ FAILED ] 1 test, listed below:
[ FAILED ] Substrait.ReadRelWithGlobFiles
but I'm not sure about this one:
[ RUN ] TestExpressionRegistry.VerifySupportedFunctions
/Users/runner/work/arrow/arrow/cpp/src/gandiva/expression_registry_test.cc:48: Failure
Expected: (element) != (functions.end()), actual: 8-byte object <40-DB 8C-8D 9B-7F 00-00> vs 8-byte object <40-DB 8C-8D 9B-7F 00-00>
function signature date64[ms] date_diff(date64[ms], int64) missing in supported functions.
/Users/runner/work/arrow/arrow/cpp/src/gandiva/expression_registry_test.cc:48: Failure
Expected: (element) != (functions.end()), actual: 8-byte object <40-DB 8C-8D 9B-7F 00-00> vs 8-byte object <40-DB 8C-8D 9B-7F 00-00>
function signature timestamp[ms] date_diff(timestamp[ms], int64) missing in supported functions.
[ FAILED ] TestExpressionRegistry.VerifySupportedFunctions (30 ms)
@rtpsw I had to rebase this again. I've looked through it and I think this is in good shape now. If you're ok with it I will merge once the tests pass.
I compared the codes, mostly manually, and couldn't find anything missing.
The tests were failing after a rebase due to a recent change to CheckRoundTrip (doesn't need a schema). I pushed a small fix.
Unfortunately, the test is now failing for me because the process thread marks the plan complete and then tries to join itself. I copied over the change we made in https://github.com/apache/arrow/commit/b9cda43bedb70a1e64a60ad712c58f471a7bc13d to address this but that change relies on an executor always being present which won't be true until ARROW-15732 merges.
@westonpace, I see you approved - is there anything holding up this PR?
@rtpsw I believe it's the same failure you noticed before, which is that there is a segfault because schedule task is synchronous. In the other PR you fixed it by requiring an executor but that introduces ordering problems if I remember correctly. Although it may not matter for a test and will allow us to move forward on this. I will try this real quick.
@rtpsw I believe it's the same failure you noticed before, which is that there is a segfault because schedule task is synchronous. In the other PR you fixed it by requiring an executor but that introduces ordering problems if I remember correctly. Although it may not matter for a test and will allow us to move forward on this. I will try this real quick.
If indeed it introduces these ordering problems, there are alternatives fixes for the segfault, e.g., use a standalone thread for that task or use a new ExecContext::async_executor() method that is guaranteed to return an asynchronous executor.
If indeed it introduces these ordering problems, there are alternatives fixes for the segfault, e.g., use a standalone thread for that task or use a new ExecContext::async_executor() method that is guaranteed to return an asynchronous executor.
Agreed. I think we can address this later though. I've rebased (had to address the query context change) and would like to merge this when it is green.
Agreed. I think we can address this later though. I've rebased (had to address the query context change) and would like to merge this when it is green.
The problem was still happening. I've fixed it properly in #15104 using the same fix that was in the demo branch (ensuring that we always have an executor and the async scheduler is never synchronous).
I've rebased this (once more) and the problem should no longer occur.
CI failure seems surprising but unrelated. I filed #15137 to follow up.
Failure appears unrelated. I will merge.
@rtpsw thanks for your persistence on this one. Sorry for the rebase troubles at the end.
Benchmark runs are scheduled for baseline = 985789165eb30cc575d67aa52469f802e5132e64 and contender = 85db6f7b08b0b6883496bc178e096e18027d7688. 85db6f7b08b0b6883496bc178e096e18027d7688 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Finished :arrow_down:0.0% :arrow_up:0.0%] ec2-t3-xlarge-us-east-2
[Failed :arrow_down:7.22% :arrow_up:6.67%] test-mac-arm
[Finished :arrow_down:2.04% :arrow_up:0.26%] ursa-i9-9960x
[Failed :arrow_down:0.0% :arrow_up:0.0%] ursa-thinkcentre-m75q
Buildkite builds:
[Finished] 85db6f7b ec2-t3-xlarge-us-east-2
[Failed] 85db6f7b test-mac-arm
[Finished] 85db6f7b ursa-i9-9960x
[Failed] 85db6f7b ursa-thinkcentre-m75q
[Finished] 98578916 ec2-t3-xlarge-us-east-2
[Failed] 98578916 test-mac-arm
[Finished] 98578916 ursa-i9-9960x
[Failed] 98578916 ursa-thinkcentre-m75q
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java
['Python', 'R'] benchmarks have high level of regressions. test-mac-arm ursa-i9-9960x
@westonpace, thanks for wrapping this up.