arrow ARROW-17980: [C++] As-of-Join Substrait extension

Replacing https://github.com/apache/arrow/pull/14385

Oct 24 '22 10:10 rtpsw

https://issues.apache.org/jira/browse/ARROW-17980

Oct 24 '22 10:10 github-actions[bot]

I'm a little torn on this PR. It seems that it includes #14415 which hasn't merged yet, although that is primarily because it blocked waiting for substrait-io/substrait#342 . At this point I think @bkietz remaining comments were fairly minor and I had addressed them. Also, since things are a bit squashed, it is unclear to me how much of #14415 this includes. Do you know if you were able to squash in all the commits from #14415 ?

Unfortunately, I can't say whether this PR includes all of #14415. To avoid confusion, I'm fine with marking this PR a draft. I'll hold to see how you and @bkietz suggest we should proceed.

Oct 24 '22 18:10 rtpsw

I think before merging we should address if this is behaviour we want to keep.

@anjakefala It is not. I would like to hide protobuf from python (I suspect an options_internal.h is doable somehow).

Nov 03 '22 21:11 westonpace

Now that #14415 has merged we can move forwards on this. However, the code here is out of date (it was based on an older revision of #14415) and I worry it might not merge so cleanly. Let me know if you have trouble and I can take a stab at it.

I will try to create a replacement PR.

Nov 25 '22 11:11 rtpsw

I ended up merging. @westonpace, does it look like what you expected?

Nov 25 '22 16:11 rtpsw

How to deal with the following CI job's error?

/arrow/cpp/src/arrow/engine/substrait/options.h:37: error: documented symbol 'enum ARROW_ENGINE_EXPORT arrow::engine::arrow::engine::ConversionStrictness' was not declared or defined. (warning treated as error, aborting now)

Nov 27 '22 10:11 rtpsw

I'm not a big fan of the fact that we need the using statements for the substrait namespace but if we really want that package name then it seems protobuf gives us little choice.

@westonpace, note that this PR is currently using a substrait_ext namespace

Dec 02 '22 08:12 rtpsw

How to deal with the following CI job's error?

/arrow/cpp/src/arrow/engine/substrait/options.h:37: error: documented symbol 'enum ARROW_ENGINE_EXPORT arrow::engine::arrow::engine::ConversionStrictness' was not declared or defined. (warning treated as error, aborting now)

Fixed in https://issues.apache.org/jira/browse/ARROW-18424

Dec 05 '22 15:12 rtpsw

@rtpsw Looks like the merge didn't go through cleanly. Let me know if you want me to clean this up. Have you been able to address the concern here: https://github.com/apache/arrow/pull/14485#pullrequestreview-1167652534 ?

Dec 06 '22 07:12 westonpace

@rtpsw Looks like the merge didn't go through cleanly. Let me know if you want me to clean this up.

Note that this is a result of a GitHub automated rebase. @westonpace, if you can easily clean this up, that would be helpful.

Have you been able to address the concern here: #14485 (review) ?

I have removed the dependence on protobuf in the header, but I haven't added the proposed cmake snippet. It sounded like the place for this snippet is under discussion. I'm also not sure whether it is required in this PR or can be deferred.

Dec 06 '22 09:12 rtpsw

@rtpsw

I believe I have captured all the changes that you made. Though it is difficult to cherry pick the changes you might have made during merging. Can you please compare this now with your local branch and, if satisfied, let me know. Otherwise, if this seems off. You can always force push back to your remote.

Dec 07 '22 01:12 westonpace

Most CI failures seem unrelated, in particular this Substrait-test failure:

D:/a/arrow/arrow/cpp/src/arrow/status.cc:137: When trying to delete temporary directory: IOError: Cannot delete file 'C:/Users/RUNNER~1/AppData/Local/Temp/substrait-tempdir-q1gzn3f7/serde_test.arrow': . Detail: [Windows error 32] The process cannot access the file because it is being used by another process.
...
[  FAILED  ] 1 test, listed below:
[  FAILED  ] Substrait.ReadRelWithGlobFiles

but I'm not sure about this one:

[ RUN      ] TestExpressionRegistry.VerifySupportedFunctions
/Users/runner/work/arrow/arrow/cpp/src/gandiva/expression_registry_test.cc:48: Failure
Expected: (element) != (functions.end()), actual: 8-byte object <40-DB 8C-8D 9B-7F 00-00> vs 8-byte object <40-DB 8C-8D 9B-7F 00-00>
function signature date64[ms] date_diff(date64[ms], int64) missing in supported functions.

/Users/runner/work/arrow/arrow/cpp/src/gandiva/expression_registry_test.cc:48: Failure
Expected: (element) != (functions.end()), actual: 8-byte object <40-DB 8C-8D 9B-7F 00-00> vs 8-byte object <40-DB 8C-8D 9B-7F 00-00>
function signature timestamp[ms] date_diff(timestamp[ms], int64) missing in supported functions.

[  FAILED  ] TestExpressionRegistry.VerifySupportedFunctions (30 ms)

Dec 10 '22 09:12 rtpsw

@rtpsw I had to rebase this again. I've looked through it and I think this is in good shape now. If you're ok with it I will merge once the tests pass.

I compared the codes, mostly manually, and couldn't find anything missing.

Dec 10 '22 09:12 rtpsw

The tests were failing after a rebase due to a recent change to CheckRoundTrip (doesn't need a schema). I pushed a small fix.

Unfortunately, the test is now failing for me because the process thread marks the plan complete and then tries to join itself. I copied over the change we made in https://github.com/apache/arrow/commit/b9cda43bedb70a1e64a60ad712c58f471a7bc13d to address this but that change relies on an executor always being present which won't be true until ARROW-15732 merges.

Dec 15 '22 23:12 westonpace

@westonpace, I see you approved - is there anything holding up this PR?

Dec 22 '22 14:12 rtpsw

@rtpsw I believe it's the same failure you noticed before, which is that there is a segfault because schedule task is synchronous. In the other PR you fixed it by requiring an executor but that introduces ordering problems if I remember correctly. Although it may not matter for a test and will allow us to move forward on this. I will try this real quick.

Dec 22 '22 17:12 westonpace

@rtpsw I believe it's the same failure you noticed before, which is that there is a segfault because schedule task is synchronous. In the other PR you fixed it by requiring an executor but that introduces ordering problems if I remember correctly. Although it may not matter for a test and will allow us to move forward on this. I will try this real quick.

If indeed it introduces these ordering problems, there are alternatives fixes for the segfault, e.g., use a standalone thread for that task or use a new ExecContext::async_executor() method that is guaranteed to return an asynchronous executor.

Dec 23 '22 09:12 rtpsw

If indeed it introduces these ordering problems, there are alternatives fixes for the segfault, e.g., use a standalone thread for that task or use a new ExecContext::async_executor() method that is guaranteed to return an asynchronous executor.

Agreed. I think we can address this later though. I've rebased (had to address the query context change) and would like to merge this when it is green.

Dec 26 '22 15:12 westonpace

Agreed. I think we can address this later though. I've rebased (had to address the query context change) and would like to merge this when it is green.

The problem was still happening. I've fixed it properly in #15104 using the same fix that was in the demo branch (ensuring that we always have an executor and the async scheduler is never synchronous).

I've rebased this (once more) and the problem should no longer occur.

Dec 30 '22 19:12 westonpace

CI failure seems surprising but unrelated. I filed #15137 to follow up.

Dec 30 '22 23:12 westonpace

Failure appears unrelated. I will merge.

Dec 31 '22 20:12 westonpace

@rtpsw thanks for your persistence on this one. Sorry for the rebase troubles at the end.

Dec 31 '22 20:12 westonpace

Benchmark runs are scheduled for baseline = 985789165eb30cc575d67aa52469f802e5132e64 and contender = 85db6f7b08b0b6883496bc178e096e18027d7688. 85db6f7b08b0b6883496bc178e096e18027d7688 is a master commit associated with this PR. Results will be available as each benchmark for each run completes. Conbench compare runs links: [Finished :arrow_down:0.0% :arrow_up:0.0%] ec2-t3-xlarge-us-east-2 [Failed :arrow_down:7.22% :arrow_up:6.67%] test-mac-arm [Finished :arrow_down:2.04% :arrow_up:0.26%] ursa-i9-9960x [Failed :arrow_down:0.0% :arrow_up:0.0%] ursa-thinkcentre-m75q Buildkite builds: [Finished] 85db6f7b ec2-t3-xlarge-us-east-2 [Failed] 85db6f7b test-mac-arm [Finished] 85db6f7b ursa-i9-9960x [Failed] 85db6f7b ursa-thinkcentre-m75q [Finished] 98578916 ec2-t3-xlarge-us-east-2 [Failed] 98578916 test-mac-arm [Finished] 98578916 ursa-i9-9960x [Failed] 98578916 ursa-thinkcentre-m75q Supported benchmarks: ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True test-mac-arm: Supported benchmark langs: C++, Python, R ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

Dec 31 '22 23:12 ursabot

['Python', 'R'] benchmarks have high level of regressions. test-mac-arm ursa-i9-9960x

Dec 31 '22 23:12 ursabot

@westonpace, thanks for wrapping this up.

Jan 01 '23 21:01 rtpsw

arrow arrow copied to clipboard

ARROW-17980: [C++] As-of-Join Substrait extension

arrow
arrow copied to clipboard