arrow icon indicating copy to clipboard operation
arrow copied to clipboard

GH-46375: [C++] Add adapters/orc directory to Meson

Open WillAyd opened this issue 6 months ago • 1 comments

Rationale for this change

Apache Orc now has native support for Meson, so we can rather easily add support now in Arrow

What changes are included in this PR?

This integrates the adapters/orc directory into the Meson configuration

Are these changes tested?

Yes

Are there any user-facing changes?

No

  • GitHub Issue: #46375

WillAyd avatar Jun 25 '25 13:06 WillAyd

I don't believe the AppVeyor failure is related

WillAyd avatar Jun 25 '25 21:06 WillAyd

I'm not familiar with Meson internals but it looks reasonable to me. cc @kou

wgtmac avatar Jun 26 '25 15:06 wgtmac

Hmm looks like the build is failing due to some UB:

[ RUN      ] TestORCWriterSingleArray.WriteStructOfStruct
../src/arrow/adapters/orc/util.cc:216:27: runtime error: signed integer overflow: -4702111234474983746 * 1000000000 cannot be represented in type 'long int'

I don't remember this from testing against a development version of orc, so its possibly a regression in the 2.2.0 release? Have to investigate more

WillAyd avatar Aug 25 '25 22:08 WillAyd

@wgtmac may help us.

kou avatar Aug 26 '25 00:08 kou

Let me try to reproduce it on my side. Is it a stable failure on your side? @WillAyd

wgtmac avatar Aug 26 '25 15:08 wgtmac

It is reproducible. If you take this branch and do:

cd cpp
meson setup builddir -Dtests=enabled -Dorc=enabled -Db_sanitize=address,undefined

Then you can either run the test suite with

meson test arrow-orc-adapter-test --print-errorlogs

or run the individual test in builddir/src/arrow/adapters/orc/arrow-orc-adapter-test

However, I erred in my comment before that this did not appear previously. It looks like even back to orc commit 15d86dfcad6fc80ee24cf785419a820a55db103b where this PR started that UB can be detected; we may have just gotten lucky with CI not failing

WillAyd avatar Aug 26 '25 16:08 WillAyd

It seems that I cannot execute -Db_sanitize=address,undefined on my MacOS M1.

wgtmac avatar Aug 27 '25 06:08 wgtmac

Are you getting any error?

WillAyd avatar Aug 27 '25 12:08 WillAyd

ERROR: Linker c++ does not support sanitizer arguments ['-fsanitize=address,undefined']

I got the above error. If I removed them, I see some compiling error from boost like below:

FAILED: [code=1] src/arrow/libarrow_testing.a.p/testing_process.cc.o
c++ -Isrc/arrow/libarrow_testing.a.p -Isrc/arrow -I../src/arrow -Isrc -I../src -I/opt/homebrew/include -I/opt/homebrew/Cellar/googletest/1.14.0/include -I//opt/homebrew/opt/llvm@18/include -fdiagnostics-color=always -D_LIBCPP_HARDENING_MODE=_LIBCPP_HARDENING_MODE_FAST -Wall -Winvalid-pch -Wextra -std=c++17 -O0 -g -Wno-unused-parameter -Wno-array-bounds -Wno-nonnull -DGTEST_HAS_PTHREAD=1 -DBOOST_FILESYSTEM_DYN_LINK=1 -DBOOST_ALL_NO_LIB -MD -MQ src/arrow/libarrow_testing.a.p/testing_process.cc.o -MF src/arrow/libarrow_testing.a.p/testing_process.cc.o.d -o src/arrow/libarrow_testing.a.p/testing_process.cc.o -c ../src/arrow/testing/process.cc
../src/arrow/testing/process.cc:268:12: error: no type named 'environment' in namespace 'boost::process'
  268 |   process::environment env_;
      |   ~~~~~~~~~^
../src/arrow/testing/process.cc:269:28: error: no member named 'child' in namespace 'boost::process'
  269 |   std::unique_ptr<process::child> process_;
      |                   ~~~~~~~~~^
../src/arrow/testing/process.cc:270:28: error: no member named 'group' in namespace 'boost::process'
  270 |   std::unique_ptr<process::group> process_group_;
      |                   ~~~~~~~~~^
../src/arrow/testing/process.cc:130:21: error: unexpected namespace name 'environment': expected expression
  130 |     env_ = process::environment(boost::this_process::environment());
      |                     ^

Let me investigate it.

wgtmac avatar Aug 27 '25 13:08 wgtmac

Ah that's interesting. Maybe just doing -b_sanitize=undefined will work? I suppose ASAN isn't required to debug this.

If not, will look further into what the macOS linker is doing

WillAyd avatar Aug 27 '25 13:08 WillAyd

Maybe the sanitizers aren't supported on arm macs? This homebrew conversation is a few years old but I don't see a resolution:

https://github.com/orgs/Homebrew/discussions/3384

WillAyd avatar Aug 27 '25 13:08 WillAyd

I have successfully reproduced it without meson. Simply run cmake .. -DCMAKE_BUILD_TYPE=Debug -DARROW_BUILD_TESTS=ON -DARROW_ORC=ON -DARROW_USE_UBSAN=ON

[----------] 3 tests from TestORCWriterNoConversion
[ RUN      ] TestORCWriterNoConversion.writeNoNulls
[       OK ] TestORCWriterNoConversion.writeNoNulls (87 ms)
[ RUN      ] TestORCWriterNoConversion.writeMixed
/Users/gangwu/Projects/arrow/cpp/src/arrow/adapters/orc/util.cc:216:27: runtime error: signed integer overflow: 6949478456235920402 * 1000000000 cannot be represented in type 'int64_t' (aka 'long long')
SUMMARY: UndefinedBehaviorSanitizer: undefined-behavior /Users/gangwu/Projects/arrow/cpp/src/arrow/adapters/orc/util.cc:216:27

wgtmac avatar Aug 27 '25 13:08 wgtmac

diff --git a/cpp/src/arrow/adapters/orc/util.cc b/cpp/src/arrow/adapters/orc/util.cc
index 6974faae59..68d062f125 100644
--- a/cpp/src/arrow/adapters/orc/util.cc
+++ b/cpp/src/arrow/adapters/orc/util.cc
@@ -212,7 +212,10 @@ Status AppendTimestampBatch(liborc::ColumnVectorBatch* column_vector_batch,
   const int64_t* seconds = batch->data.data() + offset;
   const int64_t* nanos = batch->nanoseconds.data() + offset;

-  auto transform_timestamp = [seconds, nanos](int64_t index) {
+  auto transform_timestamp = [seconds, nanos, valid_bytes](int64_t index) -> int64_t {
+    if (valid_bytes && !valid_bytes[index]) {
+      return 0;
+    }
     return seconds[index] * kOneSecondNanos + nanos[index];
   };

Could you please apply this patch?

wgtmac avatar Aug 27 '25 14:08 wgtmac

Nice! That definitely fixes it - appreciate the help!

WillAyd avatar Aug 27 '25 15:08 WillAyd

Looks like we are down to an ASAN violation on one of the tests now - @wgtmac any ideas on this one?

[ RUN      ] TestAdapterRead.ReadIntAndStringFileMultipleStripes
stderr:
AddressSanitizer:DEADLYSIGNAL
=================================================================
==4590==ERROR: AddressSanitizer: SEGV on unknown address 0x000000000010 (pc 0x7f091a443598 bp 0x50f0000015d0 sp 0x7ffc06d50860 T0)
==4590==The signal is caused by a READ memory access.
==4590==Hint: address points to the zero page.
    #0 0x7f091a443598 in orc::TypeImpl::getTypeByColumnId(unsigned long) const (/opt/conda/envs/arrow/lib/liborc.so+0x12d598)
    #1 0x7f0929778245 in arrow::adapters::orc::GetFieldMetadata(orc::Type const*) ../../arrow/cpp/src/arrow/adapters/orc/util.cc:1216
    #2 0x7f09297ae206 in arrow::adapters::orc::GetArrowField(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, orc::Type const*, bool) ../../arrow/cpp/src/arrow/adapters/orc/util.cc:1224
    #3 0x7f09296ef185 in arrow::adapters::orc::ORCFileReader::Impl::GetArrowSchema(orc::Type const&) ../../arrow/cpp/src/arrow/adapters/orc/adapter.cc:348
    #4 0x7f09296f1bce in arrow::adapters::orc::ORCFileReader::Impl::ReadSchema(orc::RowReaderOptions const&) ../../arrow/cpp/src/arrow/adapters/orc/adapter.cc:325
    #5 0x7f092970991e in arrow::adapters::orc::ORCFileReader::Impl::NextStripeReader(long, std::vector<int, std::allocator<int> > const&) ../../arrow/cpp/src/arrow/adapters/orc/adapter.cc:516
    #6 0x7f092970c0ff in arrow::adapters::orc::ORCFileReader::Impl::NextStripeReader(long) ../../arrow/cpp/src/arrow/adapters/orc/adapter.cc:548
    #7 0x7f092970c0ff in arrow::adapters::orc::ORCFileReader::NextStripeReader(long) ../../arrow/cpp/src/arrow/adapters/orc/adapter.cc:620
    #8 0x562689775563 in arrow::TestAdapterRead_ReadIntAndStringFileMultipleStripes_Test::TestBody() ../../arrow/cpp/src/arrow/adapters/orc/adapter_test.cc:397
    #9 0x7f091a2ed1ad in void testing::internal::HandleExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) (/opt/conda/envs/arrow/lib/libgtest.so.1.17.0+0x4f1ad)
    #10 0x7f091a2ed480 in testing::Test::Run() (/opt/conda/envs/arrow/lib/libgtest.so.1.17.0+0x4f480)
    #11 0x7f091a2ed826 in testing::TestInfo::Run() (/opt/conda/envs/arrow/lib/libgtest.so.1.17.0+0x4f826)
    #12 0x7f091a2edc65 in testing::TestSuite::Run() (/opt/conda/envs/arrow/lib/libgtest.so.1.17.0+0x4fc65)
    #13 0x7f091a2fc292 in testing::internal::UnitTestImpl::RunAllTests() (/opt/conda/envs/arrow/lib/libgtest.so.1.17.0+0x5e292)
    #14 0x7f091a2ede2c in testing::UnitTest::Run() (/opt/conda/envs/arrow/lib/libgtest.so.1.17.0+0x4fe2c)
    #15 0x7f091a29807e in main (/opt/conda/envs/arrow/lib/libgtest_main.so.1.17.0+0x107e)
    #16 0x7f09197a0d8f in __libc_start_call_main ../sysdeps/nptl/libc_start_call_main.h:58
    #17 0x7f09197a0e3f in __libc_start_main_impl ../csu/libc-start.c:392
    #18 0x562689621164 in _start (/build/cpp/src/arrow/adapters/orc/arrow-orc-adapter-test+0x662164)

AddressSanitizer can not provide additional info.
SUMMARY: AddressSanitizer: SEGV (/opt/conda/envs/arrow/lib/liborc.so+0x12d598) in orc::TypeImpl::getTypeByColumnId(unsigned long) const

https://github.com/apache/arrow/actions/runs/17327120602/job/49193954155?pr=46906#step:6:1855

WillAyd avatar Aug 29 '25 17:08 WillAyd

That's weird. The GetFieldMetadata function does not call orc::TypeImpl::getTypeByColumnId internally. Even the ORC library does not call it internally.

Result<std::shared_ptr<const KeyValueMetadata>> GetFieldMetadata(
    const liborc::Type* type) {
  if (type == nullptr) {
    return nullptr;
  }
  const auto keys = type->getAttributeKeys();
  if (keys.empty()) {
    return nullptr;
  }
  auto metadata = std::make_shared<KeyValueMetadata>();
  for (const auto& key : keys) {
    metadata->Append(key, type->getAttributeValue(key));
  }
  return std::const_pointer_cast<const KeyValueMetadata>(metadata);
}

wgtmac avatar Aug 30 '25 07:08 wgtmac

I tried to reproduce it on my side with ASAN enabled. It didn't fail but got the suspicious output:

[ RUN      ] TestAdapterRead.ReadIntAndStringFileMultipleStripes
mimalloc: warning: unable to allocate aligned OS memory directly, fall back to over-allocation (size: 0x40000000 bytes, address: 0x107000020000, alignment: 0x2000000, commit: 1)

wgtmac avatar Aug 30 '25 07:08 wgtmac

Hmm well that is...strange. I also had a hard time reproducing locally, although I was using the wrap file and not a conda-installed orc. Will give that a go and report back anything I find

WillAyd avatar Sep 02 '25 14:09 WillAyd

What's interesting to me is that in the logs Meson is not detecting the orc library and chooses to build it as a subproject. However, the ASAN traceback is locating an Orc installation at /opt/conda/envs/arrow/lib/liborc.so

My guess is the function name is incorrect and that something is getting mangled with the different library versions (the conda-installed one is 1.9.0)

WillAyd avatar Sep 03 '25 15:09 WillAyd

Perhaps a fix is required to use the same orc version, either built or installed.

wgtmac avatar Sep 03 '25 15:09 wgtmac

I believe the core of the issue is that orc does not distribute pkgconfig information like most (possibly all) of the other conda-installed libraries, which place their respective files in /opt/conda/envs/arrow/lib/pkgconfig. If that could be detected, then there would be no need to even build a local copy

WillAyd avatar Sep 03 '25 15:09 WillAyd

Unfortunately I'm not familiar with pkgconfig so it may still be the case in the near future. Is there any workaround?

wgtmac avatar Sep 04 '25 01:09 wgtmac

Meson can use use not only pkg-config package (orc.pc) but also CMake package (orc/orcConfig.cmake). ORC provides its CMake package. Why is pkg-config package required to use system ORC?

kou avatar Sep 04 '25 01:09 kou

The CMake config could be used as well but it is also not part of the conda installation

WillAyd avatar Sep 04 '25 02:09 WillAyd

Can we improve conda package to include the CMake config?

kou avatar Sep 04 '25 02:09 kou

I found lib/cmake/orc/orcConfig.cmake in linux-64/orc-2.2.0-h1bc01a4_0.conda listed at https://anaconda.org/conda-forge/orc/files .

kou avatar Sep 04 '25 02:09 kou

Ah nice find! Unfortunately, the conda installation seems to be pinning orc to the 1.9.0 installation (I noticed this also in https://github.com/apache/arrow/pull/47455#issuecomment-3249566332).

The 1.9.0 conda package doesn't seem to distribute that same file, so we might have to dissect what is causing that version pin

WillAyd avatar Sep 04 '25 03:09 WillAyd

OK so the issue is definitely the fact that there are multiple versions of Orc installed, and the conda-forge installed version of 1.9.0 is too old to distribute any cmake/pkgconfig information that makes it detectable.

Looking at ci/conda_env_cpp.txt, it looks like the grpc_cpp package is what is responsible for indirectly pinning orc at the old 1.9.0 version. Unfortunately it doesn't look like grpc_cpp has had an updated conda package in 2.5 years, although the upstream library does distribute newer versions.

There are a lot of potential options to fix this, but I think the cleanest (and easiest?) would be to get an updated conda package for grpc_cpp somehow

WillAyd avatar Sep 12 '25 20:09 WillAyd

It seems that grpc_cpp was split to https://anaconda.org/conda-forge/libgrpc/ , https://anaconda.org/conda-forge/grpcio and https://anaconda.org/conda-forge/grpcio-tools .

kou avatar Sep 12 '25 21:09 kou

See also: https://github.com/conda-forge/grpc-cpp-feedstock/blob/main/recipe/meta.yaml

kou avatar Sep 12 '25 21:09 kou