pybind11_abseil icon indicating copy to clipboard operation
pybind11_abseil copied to clipboard

ImportError: generic_type: cannot initialize type "StatusCode": an object with that name is already defined

Open ethanluoyc opened this issue 1 year ago • 25 comments

Hi,

I am trying to figure out some issues with using the pybind11_abseil.status usage. TensorFlow as well as some other Google projects are using the pybind11_abseil modules. I found that as of TensorFlow 2.14.0 (2.13.1 is fine), it looks like that using TensorFlow in conjunction with packages that use pybind11_abseil will fail with the error

ImportError: generic_type: cannot initialize type "StatusCode": an object with that name is already defined

This happens for example in https://github.com/google-deepmind/launchpad/issues/44 and https://github.com/tensorflow/tflite-support/issues/954. The error happens when a pybind11 extension that uses py::google::ImportStatusModule() is called.

For example

import tensorflow
import launchpad # Fails when importing an extension that uses ImportStatusModule.

Reversing the order we will see the same issues this time coming from TensorFlow trying to import a module that uses ImportStatusModule.

I noticed that there are some changes to the pybind11_abseil version used in TensorFlow 2.14.0, which incoporates some changes around the import_status_module, but I don't have a really good clue about what that change entails, it would be quite useful if you can provide some help.

Thanks!

ethanluoyc avatar Oct 20 '23 01:10 ethanluoyc

This error

ImportError: generic_type: cannot initialize type "StatusCode": an object with that name is already defined

is raised by pybind11 when enum_<absl::StatusCode>() is called a second time. I looked around for 10 minutes but still have no clue how that could happen.

We have this in pybind11_abseil (RegisterStatusBindings()):

https://github.com/pybind/pybind11_abseil/blob/f6ff92551862273d368a758b5b75703dde1f0b9d/pybind11_abseil/register_status_bindings.cc#L161

That should only get called the pybind11_abseil.status module is imported, and the standard Python import-only-once mechanisms are expected to ensure RegisterStatusBindings() is only called once.

I'd use a debugger to find out from where this line of code is reached:

  • https://github.com/pybind/pybind11/blob/74439a64a22850f16323f1e1e09ace2a909b0b61/include/pybind11/pybind11.h#L1350

Swap the import order to get a backtrace for both locations.

If using a debugger is impractical, I often simply insert long *BAD = nullptr; *BAD = 101; to trigger a segfault and then just look at the backtrace (assuming the toolchain generates a backtrace by default).

rwgk avatar Oct 20 '23 04:10 rwgk

I manage to get some stacktrace for the two places. I modifed the launchpad pybind dependencies so that it aligns with [email protected]

pybind11_abseil 2c4932ed6f6204f1656e245838f4f5eae69d2e29 this version is before the bypass_regular_import is deprecated and defaults to true, I tried to set this to false and go through the import path but not luck.

pybind11 2.10.4

Then I set a breakpoint on initialize.

Importing TensorFlow first gets us

import tensorflow
from courier.python import client  # pytype: disable=import-error
#0  pybind11::pybind11_fail (reason=<error: Cannot access memory at address 0x626f206e61203a22>)
    at external/pybind11/include/pybind11/detail/../detail/common.h:1008
#1  0x00007fff5961a892 in pybind11::detail::generic_type::initialize (this=0x7fffffffb2d0, rec=...)
    at external/pybind11/include/pybind11/pybind11.h:1301
#2  0x00007fff59eaace9 in pybind11::class_<absl::lts_20230125::StatusCode>::class_<>(pybind11::handle, char const*) (
    this=0x7fffffffb2d0, scope=..., name=0x7fff59fccff0 "StatusCode")
    at external/pybind11/include/pybind11/pybind11.h:1546
#3  0x00007fff59ea44fe in pybind11::enum_<absl::lts_20230125::StatusCode>::enum_<>(pybind11::handle const&, char const*) (this=0x7fffffffb2d0, scope=..., name=0x7fff59fccff0 "StatusCode")
    at external/pybind11/include/pybind11/pybind11.h:2178
#4  0x00007fff59e9a741 in pybind11::google::internal::RegisterStatusBindings (m=...)
    at external/pybind11_abseil/pybind11_abseil/register_status_bindings.cc:102
#5  0x00007fff59e99a75 in pybind11::google::ImportStatusModule (bypass_regular_import=true)
    at external/pybind11_abseil/pybind11_abseil/import_status_module.cc:20
#6  0x00007fff59609fe8 in courier::(anonymous namespace)::pybind11_init_py_client (m=...)
    at courier/python/py_client.cc:113
#7  0x00007fff59609eb5 in courier::(anonymous namespace)::PyInit_py_client () at courier/python/py_client.cc:111
#8  0x00007ffff7def7c7 in _PyImport_LoadDynamicModuleWithSpec (spec=spec@entry=0x7fff5a40a5c0, fp=fp@entry=0x0)
    at ./Python/importdl.c:167
#9  0x00007ffff7debf5d in _imp_create_dynamic_impl (module=<optimised out>, file=<optimised out>,

Import Launchpad first

from courier.python import client  # pytype: disable=import-error
import tensorflow
0x00007fff620d0734 in pybind11::pybind11_fail(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) ()
   from /home/yicheng/projects/launchpad/.venv/lib/python3.10/site-packages/tensorflow/python/util/_tf_stack.so
(gdb) backtrace
#0  0x00007fff620d0734 in pybind11::pybind11_fail(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) ()
   from /home/yicheng/projects/launchpad/.venv/lib/python3.10/site-packages/tensorflow/python/util/_tf_stack.so
#1  0x00007fff620de4a7 in pybind11::detail::generic_type::initialize(pybind11::detail::type_record const&) ()
   from /home/yicheng/projects/launchpad/.venv/lib/python3.10/site-packages/tensorflow/python/util/_tf_stack.so
#2  0x00007fff620f06ac in pybind11::class_<absl::lts_20230125::StatusCode>::class_<>(pybind11::handle, char const*)
    () from /home/yicheng/projects/launchpad/.venv/lib/python3.10/site-packages/tensorflow/python/util/_tf_stack.so
#3  0x00007fff620eed38 in pybind11::enum_<absl::lts_20230125::StatusCode>::enum_<>(pybind11::handle const&, char const*) () from /home/yicheng/projects/launchpad/.venv/lib/python3.10/site-packages/tensorflow/python/util/_tf_stack.so
#4  0x00007fff620edcca in pybind11::google::internal::RegisterStatusBindings(pybind11::module_) ()
   from /home/yicheng/projects/launchpad/.venv/lib/python3.10/site-packages/tensorflow/python/util/_tf_stack.so
#5  0x00007fff620edba9 in pybind11::google::ImportStatusModule(bool) ()
   from /home/yicheng/projects/launchpad/.venv/lib/python3.10/site-packages/tensorflow/python/util/_tf_stack.so
#6  0x00007fff620ccba9 in tensorflow::pybind11_init__tf_stack(pybind11::module_&) ()
   from /home/yicheng/projects/launchpad/.venv/lib/python3.10/site-packages/tensorflow/python/util/_tf_stack.so
#7  0x00007fff620cc34f in PyInit__tf_stack ()
   from /home/yicheng/projects/launchpad/.venv/lib/python3.10/site-packages/tensorflow/python/util/_tf_stack.so
#8  0x00007ffff7def7c7 in _PyImport_LoadDynamicModuleWithSpec (spec=spec@entry=0x7fff624f8790, fp=fp@entry=0x0)
    at ./Python/importdl.c:167
#9  0x00007ffff7debf5d in _imp_create_dynamic_impl (module=<optimised out>, file=<optimised out>,
    spec=0x7fff624f8790) at Python/import.c:2049

Some other observations

  1. It does look like that with TensorFlow imported first, On the launchpad side I got internal::IsStatusModuleImported() == false so it's not picking up that the status module is imported and went ahead to register the binding -> error.

ethanluoyc avatar Oct 20 '23 10:10 ethanluoyc

On the launchpad side I got internal::IsStatusModuleImported() == false

Could it be that std::type_index(typeid(absl::Status)) is different?

Could you try to print that out?

std::cout << std::type_index(typeid(absl::Status)).hash_code() << std::endl;

Reason for asking the question:

https://github.com/pybind/pybind11/blob/7969049de4c11f1d4955ababd97f340d82d2bf5a/include/pybind11/detail/type_caster_base.h#L216-L232

You could also add prints there.

rwgk avatar Oct 20 '23 19:10 rwgk

inline bool IsStatusModuleImported() {
  std::cout << "Type index" << std::type_index(typeid(absl::Status)).hash_code() << std::endl;
  std::cout << "Local type info" << detail::get_local_type_info(typeid(absl::Status)) << std::endl;
  std::cout << "Global type info "<< detail::get_global_type_info(typeid(absl::Status)) << std::endl;
  return detail::get_type_info(typeid(absl::Status));
}

Type index 11412274694001596587
Local type info0
Global type info 0

I won't be able to compile TF from source unfortunately.. I made a repro of the issue above here https://github.com/ethanluoyc/pybind11_abseil_gh12_repro.

If possible, would you mind taking a look? I tried to extract the issue from the original repo as much as possible.

ethanluoyc avatar Oct 20 '23 22:10 ethanluoyc

I won't be able to compile TF from source unfortunately.

pip install -r requirements.txt

Oh boy ... binary incompatibility I'd guess then, but how?

I'm really swamped, and digging into a situation that brings in prebuilt binaries could go anywhere.

Could you please try more at your end?

What puzzles me: if there is indeed a binary incompatibility, I'd expect the TF modules and launchpad to be isolated from each other more or less completely. A key piece of the mechanism behind that is here:

https://github.com/pybind/pybind11/blob/7969049de4c11f1d4955ababd97f340d82d2bf5a/include/pybind11/detail/internals.h#L314-L322

Could you please try to build launchpad with -DPYBIND11_INTERNALS_KIND=SOMETHINGRANDOM1697842496?

Then that will live in its own universe for sure.

It might break interop you need, but try to ignore that. We only want to know: do the two imports work in any order?

Also, I'd definitely insert prints in

https://github.com/pybind/pybind11/blob/7969049de4c11f1d4955ababd97f340d82d2bf5a/include/pybind11/detail/type_caster_base.h#L216-L232

(same link as before) to get a better idea what's happening. I'd try to print out the typeid name somehow.

rwgk avatar Oct 20 '23 23:10 rwgk

I will give it a try. Thanks!

Maybe another data point on this. I just tried using latest version of pybind11 + absl combo found from this repo and that actually makes the import error go away (at least if I import tensorflow first). I couldn't get the reverse to work out as it complains about not finding pybind11_abseil module, but I suppose that is because I didn't put the status.so in my data dependency.

ethanluoyc avatar Oct 20 '23 23:10 ethanluoyc

Sorry my C++ and Bazel is bad, but is it not calling with

bazel run -c dbg --copt '-DPYBIND11_INTERNALS_KIND=RAAANDOM' //courier:debug_client_test
external/pybind11/include/pybind11/detail/../detail/type_caster_base.h: In member function 'bool pybind11::detail::type_caster_generic::try_load_foreign_module_local(pybind11::handle)':
<command-line>: error: expected ',' or ';' before 'RAAANDOM'
external/pybind11/include/pybind11/pybind11.h: In member function 'void pybind11::detail::generic_type::initialize(const pybind11::detail::type_record&)':
<command-line>: error: expected ')' before 'RAAANDOM'
In file included from external/pybind11_abseil/pybind11_abseil/import_status_module.h:4,
                 from external/pybind11_abseil/pybind11_abseil/import_status_module.cc:1:
external/pybind11/include/pybind11/pybind11.h:1353:20: note: to match this '('
 1353 |             setattr(m_ptr, PYBIND11_MODULE_LOCAL_ID, capsule(tinfo));
      |                    ^

Update

bazel run -c dbg --cxxopt '-DPYBIND11_INTERNALS_KIND="SOMETHINGRANDOM1697842496"' //courier:debug_client_test
INFO: Analyzed target //courier:debug_client_test (0 packages loaded, 0 targets configured).
INFO: Found 1 target...
Target //courier:debug_client_test up-to-date:
  bazel-bin/courier/debug_client_test
INFO: Elapsed time: 0.052s, Critical Path: 0.00s
INFO: 1 process: 1 internal.
INFO: Build completed successfully, 1 total action
INFO: Running command line: bazel-bin/courier/debug_client_test
2023-10-21 00:26:16.435708: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
Traceback (most recent call last):
  File "/home/yicheng/.cache/bazel/_bazel_yicheng/a8240993b3009bd370b3605a74432aad/execroot/debug-pybind11/bazel-out/k8-dbg/bin/courier/debug_client_test.runfiles/debug-pybind11/courier/debug_client_test.py", line 2, in <module>
    from courier import debug_client
ImportError: generic_type: cannot initialize type "StatusCode": an object with that name is already defined

ethanluoyc avatar Oct 20 '23 23:10 ethanluoyc

Got the typeid name, it's N4absl12lts_202301256StatusE

ethanluoyc avatar Oct 20 '23 23:10 ethanluoyc

Ouch.

This is beginning to look very tricky.

Maybe it's something simple, but I don't see it.

I'm beginning to wonder: could there be two pybind11_abseil.Status modules? With incompatible PYBIND11_INTERNALS_ID?

That's just a wild guess, can you look? I think look for Status.so, how many do you find?

Alternatively, import only tensorflow, then look in sys.modules, import only launchpad, again look in sys.modules. Something like sys.modules["pybind11_abseil.Status"].__file__.

rwgk avatar Oct 20 '23 23:10 rwgk

Got the typeid name, it's N4absl12lts_202301256StatusE

Oh, that's another potential issue: maybe the two packages use different abseil-cpp versions?

rwgk avatar Oct 20 '23 23:10 rwgk

I got the absl-cpp version for TF 2.14.0 from https://github.com/tensorflow/tensorflow/blob/v2.14.0/third_party/absl/workspace.bzl

I don't see any pybind11_abseil.status module in either TF or launchpad. Looking into my sitepackages I don't see those either. I think both launchpad and TF goes through the old bypass_regular_import=true path, but I don't know exactly what that means.

ethanluoyc avatar Oct 20 '23 23:10 ethanluoyc

I think both launchpad and TF goes through the old bypass_regular_import=true path, but I don't know exactly what that means.

https://github.com/pybind/pybind11_abseil/blob/4b883e48ae749ff984c220484d54fdeb0cb4302c/pybind11_abseil/import_status_module.cc#L13-L16

I disabled that a more than year ago, in September 2022:

https://github.com/pybind/pybind11_abseil/commit/c18f6323c22edd4f95bef62623c77973f8cfd3a1

What pybind11_abseil sources where used when TF was built?

What pybind11_abseil sources does launchpad use?

I never really understood the bypass mechanism, therefore cannot really tell what could go wrong.

I was glad to get rid of it. I thought.

rwgk avatar Oct 21 '23 00:10 rwgk

What pybind11_abseil sources where used when TF was built?

2c4932ed6f6204f1656e245838f4f5eae69d2e29 (ref https://github.com/tensorflow/tensorflow/blob/v2.14.0/third_party/pybind11_abseil/workspace.bzl#L10-L13)

What pybind11_abseil sources does launchpad use?

1bb411eb1b13440d5af61660e70e8c5b5b2998a1 (ref https://github.com/google-deepmind/launchpad/blob/3b28eaed02c4294197b9ca2b8988cf68d8b5d868/WORKSPACE#L31-L34)

I think the inconsistency in the pybind11_abseil does not matter as I can reproduce the issue with the TF version in my repro repo. The commit they use also predates the deprecation of the value (and a switch of the default).

Looking closer at the diff of https://github.com/tensorflow/tensorflow/blob/v2.14.0/tensorflow/python/util/tf_stack.cc This is the import that gives ImportError if we import TF after

Between v2.14.0 and v2.13.1, the difference is that 2.14.0 starts using pybind11_abseil. There is usage of that in v2.13.1, but those usage never seems to be part of imports in the Python package, I think we can say that this issue has been there for a while :)

Looking at your test cases in this repo, I see that to test depends on pybind11_abseil:status.so. I understand this is required because we need to go through the regular python imports so the status module need to be in sys.path. However, neither TensorFlow nor launchpad distributes status.so and I think both is depending on the branch which is the default in 1bb411eb1b13440d5af61660e70e8c5b5b2998a1

  if (bypass_regular_import) {
    auto m = reinterpret_borrow<module_>(PyImport_AddModule(
        PYBIND11_TOSTRING(PYBIND11_ABSEIL_STATUS_MODULE_PATH)));
    if (!internal::IsStatusModuleImported()) {
      internal::RegisterStatusBindings(m);
    }
    // else no-op because bindings are already loaded.
    return m;
  }

ethanluoyc avatar Oct 21 '23 00:10 ethanluoyc

Away from keyboard

Qq

Can you build TF also from sources?

I believe: even if that takes you a few hours, it'll still be faster than debugging a frankenbuild with outdated sources.

On Fri, Oct 20, 2023 at 17:17 Yicheng Luo @.***> wrote:

What pybind11_abseil sources where used when TF was built?

2c4932e https://github.com/pybind/pybind11_abseil/commit/2c4932ed6f6204f1656e245838f4f5eae69d2e29 (ref https://github.com/tensorflow/tensorflow/blob/v2.14.0/third_party/pybind11_abseil/workspace.bzl#L10-L13 )

What pybind11_abseil sources does launchpad use?

1bb411e https://github.com/pybind/pybind11_abseil/commit/1bb411eb1b13440d5af61660e70e8c5b5b2998a1 (ref https://github.com/google-deepmind/launchpad/blob/3b28eaed02c4294197b9ca2b8988cf68d8b5d868/WORKSPACE#L31-L34 )

I think the inconsistency in the pybind11_abseil does not matter as I can reproduce the issue with the TF version in my repro repo. The commit they use also predates the deprecation of the value (and a switch of the default).

Looking closer at the diff of https://github.com/tensorflow/tensorflow/blob/v2.14.0/tensorflow/python/util/tf_stack.cc This is the import that gives ImportError if we import TF after

Between v2.14.0 and v2.13.1, the difference is that 2.14.0 starts using pybind11_abseil.

Looking at your test cases in this repo, I see that to test depends on pybind11_abseil:status.so. I understand this is required because we need to go through the regular python imports so the status module need to be in sys.path. However, neither TensorFlow nor launchpad distributes status.so and I think both is depending on the branch which is the default in 1bb411e https://github.com/pybind/pybind11_abseil/commit/1bb411eb1b13440d5af61660e70e8c5b5b2998a1

if (bypass_regular_import) { auto m = reinterpret_borrow<module_>(PyImport_AddModule( PYBIND11_TOSTRING(PYBIND11_ABSEIL_STATUS_MODULE_PATH))); if (!internal::IsStatusModuleImported()) { internal::RegisterStatusBindings(m); } // else no-op because bindings are already loaded. return m; }

— Reply to this email directly, view it on GitHub https://github.com/pybind/pybind11_abseil/issues/12#issuecomment-1773548290, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAFUZAGIN5A7VZ52GTSAMYLYAMIBDAVCNFSM6AAAAAA6ICKLBOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTONZTGU2DQMRZGA . You are receiving this because you commented.Message ID: @.***>

rwgk avatar Oct 21 '23 00:10 rwgk

I will dig around :) Enjoy your weekend

ethanluoyc avatar Oct 21 '23 00:10 ethanluoyc

This is a rabbit hole..

Built from source TF v2.14.0, I now get imports to work (so we do not throw ImportError in my repro repo).

However, I tried integrating my build into the launchpad repo. If I import launchpad before TensorFlow then things would work ok. But if I import TF first and then Launchpad I can't pass the unit test and it's saying:

Traceback (most recent call last):
  File "/home/yicheng/.cache/bazel/_bazel_yicheng/83e93ea5b191337ae544f4d037b4b862/execroot/launchpad/bazel-out/k8-dbg/bin/courier/python/client_test.runfiles/launchpad/courier/python/client_test.py", line 102, in testCallRebind
    result = self._client.rebind()
  File "/home/yicheng/.cache/bazel/_bazel_yicheng/83e93ea5b191337ae544f4d037b4b862/execroot/launchpad/bazel-out/k8-dbg/bin/courier/python/client_test.runfiles/launchpad/courier/python/client.py", line 56, in inner_function
    raise translate_status(e.status) from e
  File "/home/yicheng/.cache/bazel/_bazel_yicheng/83e93ea5b191337ae544f4d037b4b862/execroot/launchpad/bazel-out/k8-dbg/bin/courier/python/client_test.runfiles/launchpad/courier/python/client.py", line 43, in translate_status
    exc = StatusNotOk(s.message())
  File "<string>", line 6, in __init__
AttributeError: 'str' object has no attribute 'ok'.

The type_name does match though in either case. Building from source I got N4absl12lts_202301256StatusE.

ethanluoyc avatar Oct 21 '23 03:10 ethanluoyc

    exc = StatusNotOk(s.message())
  File "<string>", line 6, in __init__
AttributeError: 'str' object has no attribute 'ok'.

That looks like something I missed in a super-sized series of changes about a year ago. Could you please try this change:

-    exc = StatusNotOk(s.message())
+    exc = StatusNotOk(s)

If I import launchpad before TensorFlow then things would work ok.

Hm... that sounds like you still have two versions of StatusNotOk. The pybind11_abseil commits I see when following the two WORKSPACE links under https://github.com/pybind/pybind11_abseil/issues/12#issuecomment-1773548290 point to two different pybind11_abseil versions:

  • TF: https://github.com/pybind/pybind11_abseil/commits/2c4932ed6f6204f1656e245838f4f5eae69d2e29 (Sep 2022)

  • Launchpad: https://github.com/pybind/pybind11_abseil/commits/1bb411eb1b13440d5af61660e70e8c5b5b2998a1 (Feb 2022)

  • TF StatusNotOk: https://github.com/pybind/pybind11_abseil/blob/f735c948b4c2c78f75e0006cd776b41b6057b7d6/pybind11_abseil/register_status_bindings.cc#L175

  • Launchpad StatusNotOk: https://github.com/pybind/pybind11_abseil/blob/1bb411eb1b13440d5af61660e70e8c5b5b2998a1/pybind11_abseil/register_status_bindings.cc#L166

These are completely different.

Did you update one or the other to use the same version?

Really, pybind11_abseil should only be built once. I'm not familiar at all with how TF or Launchpad use bazel, and how to make both depend on the same sources + build artifacts.

rwgk avatar Oct 21 '23 04:10 rwgk

I pointed the owners of courier/python/client.py here. I'm hoping they'll update to a more modern version of pybind11_abseil. I'm also hoping they'll find a way to ensure TF and Launchpad are using the exact same version.

rwgk avatar Oct 21 '23 04:10 rwgk

Thanks!. Yeah updating the call and fix version fixes the error above.

I guess the confusing part now is why things work if I built TF 2.14.0 from the source but not with their pre-built wheels. For testing, I actually built TF with the exact checkout from their tag and got it working (since I use the same version for launchpad testing anyway).

ethanluoyc avatar Oct 21 '23 09:10 ethanluoyc

I suspect this is a compiler thing. I just tried setting CC=clang and it seems that the error goes away.

ethanluoyc avatar Oct 21 '23 10:10 ethanluoyc

I'm glad you found a path that works.

I suspect this is a compiler thing. I just tried setting CC=clang and it seems that the error goes away.

That makes me suspect that the bypass import mechanism somehow subverts the ABI isolation mechanisms I pointed out before (pybind11/detail/internals.h).

I wouldn't want to root-cause that. IMO the only good use of our time is to work towards

  1. Using a modern version of pybind11_abseil, or at least using bypass_regular_import=false
  2. Work on the bazel BUILD files so that there is no way that more than one pybind11_abseil version is used when building TF and Launchpad from sources.

rwgk avatar Oct 21 '23 18:10 rwgk

Thanks.

I think I will probably want to fix this upstream in launchpad or tensorflow. Building and maintaining bazel project that depends on TensorFlow and is manylinux is quite some work so I hope this gets fixed upstream :)

  1. If fixing in TF, I think upgrading to a more recent version of pybind11_abseil would be useful. Then I can build downstream with the existing toolchain and the problem may go away.
  2. Fixing in launchpad to use a consistent toolchain as TF. The launchpad build is still using a gcc toolchain and I think maybe upgrading to clang as used in TF would probably work as well.

I worked a bit on trying to get the later to work but apparently, it's quite difficult to set up a CC toolchain as used by TF. Maybe I will seek some help from the Launchpad authors or some other repos that use pre-built tensorflow as a dependency to create a clang toolchain to work with this.

ethanluoyc avatar Oct 21 '23 21:10 ethanluoyc

I ran this by @tkoeppe, he pointed out:

sounds a bit like an ODR violation?

It sounds like you have two "fat DSO"s for the TF and LP Python modules and both link in a copy of pybind11_abseil? That seems like a well-known failure mode (of Bazel)?

I think that's very consistent with our observations here.

rwgk avatar Oct 23 '23 23:10 rwgk

Yeah, unfortunately keeping in sync with TF's build dependency is just a lot for many downstream projects.

I don't know how to best fix this either.

Lucky for me, I hijacked a copy of TF's manylinux build toolchain configuration with clang and that seems to allow me to build manylinux wheels successfully. I am using that to work around issues with LaunchPad. There is a sibling project Reverb https://github.com/google-deepmind/reverb/tree/master/reverb which will likely have the same issue and I have created a PR with the fix.

https://github.com/google-deepmind/reverb/pull/136

ethanluoyc avatar Oct 25 '23 00:10 ethanluoyc

A proper "fix" would require (re)architecting a build system that provides a correct implementation of Python native extensions. This is not an easy problem.

A slightly cleaner "hack" for the meantime might be to just control symbol visibility for Python module DSOs better. If the (duplicately linked) Abseil dependencies in each module are not exported, there's less chance of a mismatch at runtime.

tkoeppe avatar Oct 25 '23 12:10 tkoeppe