GH-46375: [C++] Add adapters/orc directory to Meson
Rationale for this change
Apache Orc now has native support for Meson, so we can rather easily add support now in Arrow
What changes are included in this PR?
This integrates the adapters/orc directory into the Meson configuration
Are these changes tested?
Yes
Are there any user-facing changes?
No
- GitHub Issue: #46375
I don't believe the AppVeyor failure is related
I'm not familiar with Meson internals but it looks reasonable to me. cc @kou
Hmm looks like the build is failing due to some UB:
[ RUN ] TestORCWriterSingleArray.WriteStructOfStruct
../src/arrow/adapters/orc/util.cc:216:27: runtime error: signed integer overflow: -4702111234474983746 * 1000000000 cannot be represented in type 'long int'
I don't remember this from testing against a development version of orc, so its possibly a regression in the 2.2.0 release? Have to investigate more
@wgtmac may help us.
Let me try to reproduce it on my side. Is it a stable failure on your side? @WillAyd
It is reproducible. If you take this branch and do:
cd cpp
meson setup builddir -Dtests=enabled -Dorc=enabled -Db_sanitize=address,undefined
Then you can either run the test suite with
meson test arrow-orc-adapter-test --print-errorlogs
or run the individual test in builddir/src/arrow/adapters/orc/arrow-orc-adapter-test
However, I erred in my comment before that this did not appear previously. It looks like even back to orc commit 15d86dfcad6fc80ee24cf785419a820a55db103b where this PR started that UB can be detected; we may have just gotten lucky with CI not failing
It seems that I cannot execute -Db_sanitize=address,undefined on my MacOS M1.
Are you getting any error?
ERROR: Linker c++ does not support sanitizer arguments ['-fsanitize=address,undefined']
I got the above error. If I removed them, I see some compiling error from boost like below:
FAILED: [code=1] src/arrow/libarrow_testing.a.p/testing_process.cc.o
c++ -Isrc/arrow/libarrow_testing.a.p -Isrc/arrow -I../src/arrow -Isrc -I../src -I/opt/homebrew/include -I/opt/homebrew/Cellar/googletest/1.14.0/include -I//opt/homebrew/opt/llvm@18/include -fdiagnostics-color=always -D_LIBCPP_HARDENING_MODE=_LIBCPP_HARDENING_MODE_FAST -Wall -Winvalid-pch -Wextra -std=c++17 -O0 -g -Wno-unused-parameter -Wno-array-bounds -Wno-nonnull -DGTEST_HAS_PTHREAD=1 -DBOOST_FILESYSTEM_DYN_LINK=1 -DBOOST_ALL_NO_LIB -MD -MQ src/arrow/libarrow_testing.a.p/testing_process.cc.o -MF src/arrow/libarrow_testing.a.p/testing_process.cc.o.d -o src/arrow/libarrow_testing.a.p/testing_process.cc.o -c ../src/arrow/testing/process.cc
../src/arrow/testing/process.cc:268:12: error: no type named 'environment' in namespace 'boost::process'
268 | process::environment env_;
| ~~~~~~~~~^
../src/arrow/testing/process.cc:269:28: error: no member named 'child' in namespace 'boost::process'
269 | std::unique_ptr<process::child> process_;
| ~~~~~~~~~^
../src/arrow/testing/process.cc:270:28: error: no member named 'group' in namespace 'boost::process'
270 | std::unique_ptr<process::group> process_group_;
| ~~~~~~~~~^
../src/arrow/testing/process.cc:130:21: error: unexpected namespace name 'environment': expected expression
130 | env_ = process::environment(boost::this_process::environment());
| ^
Let me investigate it.
Ah that's interesting. Maybe just doing -b_sanitize=undefined will work? I suppose ASAN isn't required to debug this.
If not, will look further into what the macOS linker is doing
Maybe the sanitizers aren't supported on arm macs? This homebrew conversation is a few years old but I don't see a resolution:
https://github.com/orgs/Homebrew/discussions/3384
I have successfully reproduced it without meson. Simply run cmake .. -DCMAKE_BUILD_TYPE=Debug -DARROW_BUILD_TESTS=ON -DARROW_ORC=ON -DARROW_USE_UBSAN=ON
[----------] 3 tests from TestORCWriterNoConversion
[ RUN ] TestORCWriterNoConversion.writeNoNulls
[ OK ] TestORCWriterNoConversion.writeNoNulls (87 ms)
[ RUN ] TestORCWriterNoConversion.writeMixed
/Users/gangwu/Projects/arrow/cpp/src/arrow/adapters/orc/util.cc:216:27: runtime error: signed integer overflow: 6949478456235920402 * 1000000000 cannot be represented in type 'int64_t' (aka 'long long')
SUMMARY: UndefinedBehaviorSanitizer: undefined-behavior /Users/gangwu/Projects/arrow/cpp/src/arrow/adapters/orc/util.cc:216:27
diff --git a/cpp/src/arrow/adapters/orc/util.cc b/cpp/src/arrow/adapters/orc/util.cc
index 6974faae59..68d062f125 100644
--- a/cpp/src/arrow/adapters/orc/util.cc
+++ b/cpp/src/arrow/adapters/orc/util.cc
@@ -212,7 +212,10 @@ Status AppendTimestampBatch(liborc::ColumnVectorBatch* column_vector_batch,
const int64_t* seconds = batch->data.data() + offset;
const int64_t* nanos = batch->nanoseconds.data() + offset;
- auto transform_timestamp = [seconds, nanos](int64_t index) {
+ auto transform_timestamp = [seconds, nanos, valid_bytes](int64_t index) -> int64_t {
+ if (valid_bytes && !valid_bytes[index]) {
+ return 0;
+ }
return seconds[index] * kOneSecondNanos + nanos[index];
};
Could you please apply this patch?
Nice! That definitely fixes it - appreciate the help!
Looks like we are down to an ASAN violation on one of the tests now - @wgtmac any ideas on this one?
[ RUN ] TestAdapterRead.ReadIntAndStringFileMultipleStripes
stderr:
AddressSanitizer:DEADLYSIGNAL
=================================================================
==4590==ERROR: AddressSanitizer: SEGV on unknown address 0x000000000010 (pc 0x7f091a443598 bp 0x50f0000015d0 sp 0x7ffc06d50860 T0)
==4590==The signal is caused by a READ memory access.
==4590==Hint: address points to the zero page.
#0 0x7f091a443598 in orc::TypeImpl::getTypeByColumnId(unsigned long) const (/opt/conda/envs/arrow/lib/liborc.so+0x12d598)
#1 0x7f0929778245 in arrow::adapters::orc::GetFieldMetadata(orc::Type const*) ../../arrow/cpp/src/arrow/adapters/orc/util.cc:1216
#2 0x7f09297ae206 in arrow::adapters::orc::GetArrowField(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, orc::Type const*, bool) ../../arrow/cpp/src/arrow/adapters/orc/util.cc:1224
#3 0x7f09296ef185 in arrow::adapters::orc::ORCFileReader::Impl::GetArrowSchema(orc::Type const&) ../../arrow/cpp/src/arrow/adapters/orc/adapter.cc:348
#4 0x7f09296f1bce in arrow::adapters::orc::ORCFileReader::Impl::ReadSchema(orc::RowReaderOptions const&) ../../arrow/cpp/src/arrow/adapters/orc/adapter.cc:325
#5 0x7f092970991e in arrow::adapters::orc::ORCFileReader::Impl::NextStripeReader(long, std::vector<int, std::allocator<int> > const&) ../../arrow/cpp/src/arrow/adapters/orc/adapter.cc:516
#6 0x7f092970c0ff in arrow::adapters::orc::ORCFileReader::Impl::NextStripeReader(long) ../../arrow/cpp/src/arrow/adapters/orc/adapter.cc:548
#7 0x7f092970c0ff in arrow::adapters::orc::ORCFileReader::NextStripeReader(long) ../../arrow/cpp/src/arrow/adapters/orc/adapter.cc:620
#8 0x562689775563 in arrow::TestAdapterRead_ReadIntAndStringFileMultipleStripes_Test::TestBody() ../../arrow/cpp/src/arrow/adapters/orc/adapter_test.cc:397
#9 0x7f091a2ed1ad in void testing::internal::HandleExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) (/opt/conda/envs/arrow/lib/libgtest.so.1.17.0+0x4f1ad)
#10 0x7f091a2ed480 in testing::Test::Run() (/opt/conda/envs/arrow/lib/libgtest.so.1.17.0+0x4f480)
#11 0x7f091a2ed826 in testing::TestInfo::Run() (/opt/conda/envs/arrow/lib/libgtest.so.1.17.0+0x4f826)
#12 0x7f091a2edc65 in testing::TestSuite::Run() (/opt/conda/envs/arrow/lib/libgtest.so.1.17.0+0x4fc65)
#13 0x7f091a2fc292 in testing::internal::UnitTestImpl::RunAllTests() (/opt/conda/envs/arrow/lib/libgtest.so.1.17.0+0x5e292)
#14 0x7f091a2ede2c in testing::UnitTest::Run() (/opt/conda/envs/arrow/lib/libgtest.so.1.17.0+0x4fe2c)
#15 0x7f091a29807e in main (/opt/conda/envs/arrow/lib/libgtest_main.so.1.17.0+0x107e)
#16 0x7f09197a0d8f in __libc_start_call_main ../sysdeps/nptl/libc_start_call_main.h:58
#17 0x7f09197a0e3f in __libc_start_main_impl ../csu/libc-start.c:392
#18 0x562689621164 in _start (/build/cpp/src/arrow/adapters/orc/arrow-orc-adapter-test+0x662164)
AddressSanitizer can not provide additional info.
SUMMARY: AddressSanitizer: SEGV (/opt/conda/envs/arrow/lib/liborc.so+0x12d598) in orc::TypeImpl::getTypeByColumnId(unsigned long) const
https://github.com/apache/arrow/actions/runs/17327120602/job/49193954155?pr=46906#step:6:1855
That's weird. The GetFieldMetadata function does not call orc::TypeImpl::getTypeByColumnId internally. Even the ORC library does not call it internally.
Result<std::shared_ptr<const KeyValueMetadata>> GetFieldMetadata(
const liborc::Type* type) {
if (type == nullptr) {
return nullptr;
}
const auto keys = type->getAttributeKeys();
if (keys.empty()) {
return nullptr;
}
auto metadata = std::make_shared<KeyValueMetadata>();
for (const auto& key : keys) {
metadata->Append(key, type->getAttributeValue(key));
}
return std::const_pointer_cast<const KeyValueMetadata>(metadata);
}
I tried to reproduce it on my side with ASAN enabled. It didn't fail but got the suspicious output:
[ RUN ] TestAdapterRead.ReadIntAndStringFileMultipleStripes
mimalloc: warning: unable to allocate aligned OS memory directly, fall back to over-allocation (size: 0x40000000 bytes, address: 0x107000020000, alignment: 0x2000000, commit: 1)
Hmm well that is...strange. I also had a hard time reproducing locally, although I was using the wrap file and not a conda-installed orc. Will give that a go and report back anything I find
What's interesting to me is that in the logs Meson is not detecting the orc library and chooses to build it as a subproject. However, the ASAN traceback is locating an Orc installation at /opt/conda/envs/arrow/lib/liborc.so
My guess is the function name is incorrect and that something is getting mangled with the different library versions (the conda-installed one is 1.9.0)
Perhaps a fix is required to use the same orc version, either built or installed.
I believe the core of the issue is that orc does not distribute pkgconfig information like most (possibly all) of the other conda-installed libraries, which place their respective files in /opt/conda/envs/arrow/lib/pkgconfig. If that could be detected, then there would be no need to even build a local copy
Unfortunately I'm not familiar with pkgconfig so it may still be the case in the near future. Is there any workaround?
Meson can use use not only pkg-config package (orc.pc) but also CMake package (orc/orcConfig.cmake). ORC provides its CMake package. Why is pkg-config package required to use system ORC?
The CMake config could be used as well but it is also not part of the conda installation
Can we improve conda package to include the CMake config?
I found lib/cmake/orc/orcConfig.cmake in linux-64/orc-2.2.0-h1bc01a4_0.conda listed at https://anaconda.org/conda-forge/orc/files .
Ah nice find! Unfortunately, the conda installation seems to be pinning orc to the 1.9.0 installation (I noticed this also in https://github.com/apache/arrow/pull/47455#issuecomment-3249566332).
The 1.9.0 conda package doesn't seem to distribute that same file, so we might have to dissect what is causing that version pin
OK so the issue is definitely the fact that there are multiple versions of Orc installed, and the conda-forge installed version of 1.9.0 is too old to distribute any cmake/pkgconfig information that makes it detectable.
Looking at ci/conda_env_cpp.txt, it looks like the grpc_cpp package is what is responsible for indirectly pinning orc at the old 1.9.0 version. Unfortunately it doesn't look like grpc_cpp has had an updated conda package in 2.5 years, although the upstream library does distribute newer versions.
There are a lot of potential options to fix this, but I think the cleanest (and easiest?) would be to get an updated conda package for grpc_cpp somehow
It seems that grpc_cpp was split to https://anaconda.org/conda-forge/libgrpc/ , https://anaconda.org/conda-forge/grpcio and https://anaconda.org/conda-forge/grpcio-tools .
See also: https://github.com/conda-forge/grpc-cpp-feedstock/blob/main/recipe/meta.yaml