GH-41706: [C++][Acero] Modify asof_join_node test to provoke issue
Rationale for this change
This is initial PR. I found that with specyfics parameters test fails.
What changes are included in this PR?
In this PR I provoke fail of asof_join_node_test.
- GitHub Issue: #41706
Thanks for opening a pull request!
If this is not a minor PR. Could you open an issue for this pull request on GitHub? https://github.com/apache/arrow/issues/new/choose
Opening GitHub issues ahead of time contributes to the Openness of the Apache Arrow project.
Then could you also rename the pull request title in the following format?
GH-${GITHUB_ISSUE_ID}: [${COMPONENT}] ${SUMMARY}
or
MINOR: [${COMPONENT}] ${SUMMARY}
In the case of PARQUET issues on JIRA the title also supports:
PARQUET-${JIRA_ISSUE_ID}: [${COMPONENT}] ${SUMMARY}
See also:
:warning: GitHub issue #41706 has been automatically assigned in GitHub to PR creator.
:warning: GitHub issue #41706 has been automatically assigned in GitHub to PR creator.
This CI failure seems relevant:
FAILED: src/arrow/acero/CMakeFiles/arrow_acero_objlib.dir/asof_join_node.cc.o
/usr/bin/ccache /usr/lib/ccache/clang++-14 -DADDRESS_SANITIZER -DARROW_ACERO_EXPORTING -DARROW_HAVE_RUNTIME_AVX2 -DARROW_HAVE_RUNTIME_AVX512 -DARROW_HAVE_RUNTIME_BMI2 -DARROW_HAVE_RUNTIME_SSE4_2 -DARROW_HAVE_SSE4_2 -DARROW_UBSAN -DJSON_DIAGNOSTICS=0 -DJSON_USE_IMPLICIT_CONVERSIONS=1 -I/build/cpp/src -I/arrow/cpp/src -I/arrow/cpp/src/generated -isystem /build/cpp/opentelemetry_ep-install/include -Qunused-arguments -fcolor-diagnostics -Wall -Wextra -Wdocumentation -DARROW_WARN_DOCUMENTATION -Wshorten-64-to-32 -Wno-missing-braces -Wno-unused-parameter -Wno-constant-logical-operand -Wno-return-stack-address -Wdate-time -Wno-unknown-warning-option -Wno-pass-failed -msse4.2 -fsanitize=address -DADDRESS_SANITIZER -fsanitize=undefined -fno-sanitize=alignment,vptr,function,float-divide-by-zero -fno-sanitize-recover=all -fsanitize-coverage=pc-table,inline-8bit-counters,edge,no-prune,trace-cmp,trace-div,trace-gep -fsanitize-blacklist=/arrow/cpp/build-support/sanitizer-disallowed-entries.txt -g -Werror -O0 -ggdb -g1 -fPIC -fsanitize-coverage=pc-table,inline-8bit-counters,edge,no-prune,trace-cmp,trace-div,trace-gep -std=c++17 -MD -MT src/arrow/acero/CMakeFiles/arrow_acero_objlib.dir/asof_join_node.cc.o -MF src/arrow/acero/CMakeFiles/arrow_acero_objlib.dir/asof_join_node.cc.o.d -o src/arrow/acero/CMakeFiles/arrow_acero_objlib.dir/asof_join_node.cc.o -c /arrow/cpp/src/arrow/acero/asof_join_node.cc
In file included from /arrow/cpp/src/arrow/acero/asof_join_node.cc:18:
In file included from /arrow/cpp/src/arrow/acero/asof_join_node.h:20:
In file included from /arrow/cpp/src/arrow/acero/options.h:21:
In file included from /usr/bin/../lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/memory:76:
/usr/bin/../lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/unique_ptr.h:85:2: error: delete called on non-final 'arrow::acero::InputState' that has virtual functions but non-virtual destructor [-Werror,-Wdelete-non-abstract-non-virtual-dtor]
delete __ptr;
^
/usr/bin/../lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/unique_ptr.h:361:4: note: in instantiation of member function 'std::default_delete<arrow::acero::InputState>::operator()' requested here
get_deleter()(std::move(__ptr));
^
/arrow/cpp/src/arrow/acero/asof_join_node.cc:518:12: note: in instantiation of member function 'std::unique_ptr<arrow::acero::InputState>::~unique_ptr' requested here
return std::make_unique<InputState>(index, tolerance, must_hash, may_rehash,
^
1 error generated.
Let me pitch in. Disclaimer I am working @mroz45 on the same project using arrow.
-bool require_sequenced_output = false)
+bool require_sequenced_output = true)
Changing this default would be a breaking change and I'm not certain it's warranted.
Without it python tests are failing.
Should this only be
Ordering::Implicitifrequire_sequenced_outputis set?
It requires deeper and breaking changes which I think are necessary.
require_sequenced_output - means the source should give implicit ordering to produces batches ant therefore require_sequenced_output should be moved from ScanNodeOptions to ScanOptions to allow pass this option to ScannerBuilder which is used by python. But in fact I think there should be unified way to assert implicit ordering in all source nodes. Or maybe the need for implicit ordering should propagate from nodes that need ordering (asof_join, fetch etc.) down the line to source nodes (and maybe fail if the source node cannot provide it). There are few related issues:
no standardized sorting information
add ordering information to exec batches
Add AsofJoin Ordering Assertion
This issue gave me the idea that implicit ordering should be asserted by default. And additional source node/additional option to assert no ordering - to enable some performance optimization for "don't care" ordering cases. This would fix those issues: asof_join node not working propertly order is unstable Preserve order when writing dataset ordering is weird dataset not preserving ordering scan node not asserting ordering
We are willing to contribute to fix ordering issue within acero but we have next to none experience in python/Cython. Also the size of the issue seems to grow with every little change. I think the ordering in Acero is a little bigger topic to discuss.
The addition of ordering to python was more straightforward than I initially anticipated. Current version of pull request should fix at least some of above-mentioned python issues. Please review python changes carefully since it was our first contact with cython.
PS. Can we also Export SequencingQueue and SerialSequencingQueue This is useful in implementing custom nodes that require ordering. It seems reasonable to export it for custom nodes developers. Or maybe I should create separate pull request?
It looks like all checks passed. There are probably few issues (like this, this maybe ) that should be verified and closed by this pull request. I think some of this work also overlaps with GH-26818
:warning: GitHub issue #41706 has been automatically assigned in GitHub to PR creator.
@gitmodimo thanks for the help getting this sorted out!
@gitmodimo thanks for the help getting this sorted out!
My pleasure. Me and @mroz45 have few more enhancements for upstreaming. We will post them soon.
After merging your PR, Conbench analyzed the 3 benchmarking runs that have been run so far on merge-commit 2a0f06c4a40ec64e1f4f046a7a8aa165329c66d6.
There were no benchmark performance regressions. 🎉
The full Conbench report has more details.
@gitmodimo I have reworked some of this PR's changes in #44616 and added some of my code from #44470.
Please have a look.