[LLVM] dynamic_cast issue
Describe the issue
The C++ dynamic_cast only works if the dynamic_cast is called from the same shared library that has the definition of Base class and Derived class. (all the code are located in the same shared
library). On the other hand, if the dynamic_cast is called from another shared library (during linking process, this share library links the library with the real definition of classes),
dynamic_cast failed and returned the null pointer.
Steps to reproduce the issue
-
git clone https://github.com/changlun-adyen/graalvm-dynamic-cast-issue.git - change the GraalVM toolchain path in
build.sh - execute
build.sh - execute
run.sh(make sure the envJAVA_HOMEis set to GraalVM JDK)
Describe GraalVM and your environment:
- GraalVM version: 22.1.0 EE
- JDK major version: JDK11
- OS: macOS Monterey
- Architecture: AMD64
The output of java -Xinternalversion: Java HotSpot(TM) 64-Bit Server VM (11.0.15+8-LTS-jvmci-22.1-b05) for bsd-amd64 JRE (11.0.15+8-LTS-jvmci-22.1-b05), built on Apr 4 2022 04:05:20 by "graal1" with gcc 4.2.1 Compatible Apple LLVM 11.0.0 (clang-1100.0.33.17)
More details
The below code failed at dynamic_cast, which should not happen no matter where the code is located.
#include <iostream>
#include "test.h"
extern "C" {
void test2() {
Derived *ptr_derived = new Derived();
Base *ptr_base = ptr_derived;
auto* cast_ptr = dynamic_cast<Derived*>(ptr_base);
if(cast_ptr) {
std::cout << "ok" << std::endl;
} else {
std::cout << "failed at dynamic_cast" << std::endl;
}
}
}
The label should be changed to llvm. Sorry that I put it in bug report.
So, I've been playing around a bit with this reproducer. There are several weird things going on here.
First of all: I can also reproduce this issue without GraalVM at all, simply by adding a main file that loads the two libraries with dlopen, compiling the whole thing with the native-mode clang, and running it. I think the reason has to do with weak symbols not interacting well with dynamic library loading on Linux. The funny thing is, compiling the whole thing with gcc, it works fine. Not sure what to think of that. It might be a bug in LLVM.
See https://github.com/aalexand/sharedlib_typeinfo for a nice writeup about these problems in general.
See https://github.com/rschatz/graalvm-dynamic-cast-issue/tree/master for a clone of your repro case, where I'm trying to run the same test in multilpe connfigurations, with and without GraalVM.
The funny thing is: Running the main C program does work just fine on GraalVM and GCC, while it fails with Clang and without GraalVM. The Java reproducer always fails. That is for sure a bug in GraalVM, I don't think it should make a difference whether you dlopen a library or polyglot.eval it. But I'm not sure in which direction the bug goes, in particular since GCC and Clang differ here in their behavior.
One interesting observation is, when looking at the exported symbols of the libraries, we see that the "typeinfo" for both types is in both libraries, although just as a "weak" symbol (V means weak):
$ nm --demangle managed/libtest.so
[...]
0000000000003708 V typeinfo for Base
0000000000003718 V typeinfo for Derived
[...]
$ nm --demangle managed/libtest2.so
[...]
0000000000003630 V typeinfo for Base
0000000000003640 V typeinfo for Derived
[...]
See https://github.com/rschatz/graalvm-dynamic-cast-issue/tree/fixed for a variant of the test where this is worked around by adding a virtual destructor to both classes:
$ nm --demangle managed/libtest.so
[...]
0000000000003a68 D typeinfo for Base
0000000000003a78 D typeinfo for Derived
[...]
$ nm --demangle managed/libtest2.so
[...]
U typeinfo for Base
U typeinfo for Derived
[...]
This makes the typeinfo symbols strong in libtest.so, and undefined in libtest2.so. And suddenly all tests work as expected in all variants.
Another possibility (suggested in https://github.com/aalexand/sharedlib_typeinfo) is to dlopen with RTLD_GLOBAL. Unfortunately we don't have an equivalent of the RTLD_GLOBAL flag for polyglot.eval yet, so I could test this only with the dlopen variant of the test. But it seems to fix the issue, too (see https://github.com/rschatz/graalvm-dynamic-cast-issue/tree/rtld_global).
To summarize: C++ runtime type info doesn't interact very well with dynamic loading of libraries. We definitely have bugs there in GraalVM, but I'm not 100% sure if fixing these bugs is "enough" to fix this issue. I definitely have to investigate more ;)
Thanks for the investigation and great explanation. We will take a look if adding virtual destructors is sufficient as workaround in our usage for now.