compute-runtime
compute-runtime copied to clipboard
ocloc missing symbols from libigc.so
howdy,
i'm trying to compile this fine project for alpine linux, during that i ran into a bunch of roadblocks with this (and following issues) i'll try to document and validate my fixes to these. first issue:
Error loading the Generic builtin resource
Build failed with error code: -11
Command was: /home/s/tasks/aports/ugly/compute-runtime/src/compute-runtime-20.08.15750/build/bin/ocloc -q -file scheduler.cl -device bdw -cl-intel-greater-than-4GB-buffer-required -64 -out_dir /home/s/tasks/aports/ugly/compute-runtime/src/compute-runtime-20.08.15750/build/bin/scheduler/x64/gen8 -cpp_file -options -I/usr/include/igc -I/usr/include/igc/cif -I/usr/include/igc/ocl_igc_shared/executable_format -I/usr/include/igc/ocl_igc_shared/device_enqueue -I ../gen8 -cl-kernel-arg-info -cl-std=CL2.0 -cl-intel-disable-a64WA
make[2]: *** [igdrcl_lib_release/scheduler/CMakeFiles/scheduler_Gen8core.dir/build.make:62: bin/scheduler/x64/gen8/scheduler_Gen8core.bin] Error 245
make[2]: Leaving directory '/home/s/tasks/aports/ugly/compute-runtime/src/compute-runtime-20.08.15750/build'
make[1]: *** [CMakeFiles/Makefile2:6914: igdrcl_lib_release/scheduler/CMakeFiles/scheduler_Gen8core.dir/all] Error 2
make[1]: *** Waiting for unfinished jobs....
fixed by adding igc to the target_link_libraries of the offline_compiler (this patch also adds the libs for the symbols - like backtrace et al - that musl does not provide)
diff -Nurw compute-runtime-20.08.15750/offline_compiler/CMakeLists.txt src/compute-runtime-20.08.15750/offline_compiler/CMakeLists.txt
--- compute-runtime-20.08.15750/offline_compiler/CMakeLists.txt 2020-02-29 00:33:03.068525017 +0000
+++ src/compute-runtime-20.08.15750/offline_compiler/CMakeLists.txt 2020-02-29 00:41:59.361882810 +0000
@@ -140,7 +140,7 @@
endif()
if(UNIX)
- target_link_libraries(ocloc dl pthread)
+ target_link_libraries(ocloc dl unwind execinfo igc)
endif()
set_target_properties(ocloc PROPERTIES FOLDER "offline_compiler")
it seems that by linking libigc.so this problem is fixed, but i wonder why is libigc missing at all, is it dynamically loaded, but something goes wrong during that? or is it ok to just at libigc as a target_link_library?
thanks for any insights.
IGC libraries are being loaded by ocloc using dlopen, see
- line 332
- line 16
- line 37
- line 15
- IGC line 2060](https://github.com/intel/intel-graphics-compiler/blob/master/IGC/CMakeLists.txt#L2060)
- line 373
Can you run ocloc manually with LD_DEBUG=libs environment variable set? As you link ocloc with igc library I suppose there could be problem with symbols conflicts. You can also run ocloc with LD_DEBUG=all environment variable set, to see how symbols are resolved.
musl does not implement LD_DEBUG. i'll strace ocloc instead and see what libs it loads.
what's loaded:
"/usr/lib/libunwind.so.8"
"/usr/lib/libexecinfo.so.1"
"/usr/lib/libstdc++.so.6"
"/usr/lib/libgcc_s.so.1"
"/usr/lib/libigdfcl.so.1"
"/usr/lib/libLLVM-9.so"
"/usr/lib/libopencl-clang.so.9"
"/usr/lib/libffi.so.6"
"/lib/libz.so.1"
"/usr/lib/libxml2.so.2"
"/usr/lib/liblzma.so.5"
"/usr/lib/libigc.so.1"
strangely libgic gets loaded, but still i get the same error as in the issue indicated above. however if i add an LD_PRELOAD=/usr/lib/libigc.so.1
to the ocloc invocation ocloc succeeds and returns without output.
it looks like libigc.so is mapped correctly, as seen in strace:
open("/usr/lib/libigc.so.1", O_RDONLY|O_CLOEXEC) = 3
fcntl(3, F_SETFD, FD_CLOEXEC) = 0
fstat(3, {st_mode=S_IFREG|0755, st_size=28370792, ...}) = 0
read(3, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0\320\241\r\0\0\0\0\0"..., 960) = 960
mmap(NULL, 28577792, PROT_READ, MAP_PRIVATE, 3, 0) = 0x7f3031a96000
mmap(0x7f3031b6b000, 7450624, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_FIXED, 3, 0xd5000) = 0x7f3031b6b000
mmap(0x7f3032286000, 1400832, PROT_READ, MAP_PRIVATE|MAP_FIXED, 3, 0x7f0000) = 0x7f3032286000
mmap(0x7f30323dc000, 18853888, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED, 3, 0x945000) = 0x7f30323dc000
mmap(0x7f30335a5000, 204800, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x7f30335a5000
close(3) = 0
"Error loading the Generic builtin resource" this message came from IGC, see line 951
when i run ocloc with LD_PRELOAD=/usr/lib/libigc.so
and strace it, then i get this:
open("/usr/lib/libigc.so.1", O_RDONLY|O_CLOEXEC) = 3
fcntl(3, F_SETFD, FD_CLOEXEC) = 0
fstat(3, {st_mode=S_IFREG|0755, st_size=28370792, ...}) = 0
read(3, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0\320\241\r\0\0\0\0\0"..., 960) = 960
mmap(NULL, 28577792, PROT_READ, MAP_PRIVATE, 3, 0) = 0x7ff64758a000
mmap(0x7ff64765f000, 7450624, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_FIXED, 3, 0xd5000) = 0x7ff64765f000
mmap(0x7ff647d7a000, 1400832, PROT_READ, MAP_PRIVATE|MAP_FIXED, 3, 0x7f0000) = 0x7ff647d7a000
mmap(0x7ff647ed0000, 18853888, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED, 3, 0x945000) = 0x7ff647ed0000
mmap(0x7ff649099000, 204800, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x7ff649099000
close(3) = 0
looks the same as in https://github.com/intel/compute-runtime/issues/265#issuecomment-594522554
only it is loaded first, not last
"Error loading the Generic builtin resource" this message came from IGC, see line 951
yes, and it is fixed by linking/preloading libigc to ocloc, i traced that.
Can you verify IGC was built correctly, linked with builtins? I've observed successful IGC builds although there were problems with builtins.
"Error loading the Generic builtin resource" this message came from IGC, see line 951
llvm::LoadBufferFromResource is failing and returning NULL here: https://github.com/intel/intel-graphics-compiler/blob/67351f4e52f52eb2e4d68a9a40599a77733b5603/IGC/AdaptorOCL/OCL/LoadBuffer.cpp#L59
when libigc is not linked/preloaded by only dlopened
Can you verify IGC was built correctly, linked with builtins? I've observed successful IGC builds although there were problems with builtins.
sure, how do i verify builtins?
On Ubuntu 20.04 system I've such builtins symbols exported by igc 1.0.3390:
nm -DC /usr/lib/x86_64-linux-gnu/libigc.so | grep _igc_bif_
00000000018dc020 D _igc_bif_BC_120
00000000018dc000 D _igc_bif_BC_120_size
0000000001957680 D _igc_bif_BC_121
000000000195766c D _igc_bif_BC_121_size
00000000019d3be0 D _igc_bif_BC_122
00000000019d3bdc D _igc_bif_BC_122_size
$ dpkg -l libigc | grep libigc
ii libigc 1.0.3390-1~ppa1~focal1 amd64 Intel(R) Graphics Compiler
confirmed, i have this:
% nm -DC /usr/lib/libigc.so | grep _igc_bif_
0000000000aa7020 D _igc_bif_BC_120
0000000000aa7000 D _igc_bif_BC_120_size
0000000000b226e0 D _igc_bif_BC_121
0000000000b226d8 D _igc_bif_BC_121_size
0000000000b9ecc0 D _igc_bif_BC_122
0000000000b9eca8 D _igc_bif_BC_122_size
It's weird, maybe dlopen works differently on mucl. ocloc uses dlopen with RTLD_LAZY | RTLD_DEEPBIND to load fcl and igc, then igc loads builtins symbols using dlsym with RTLD_DEFAULT
after consulting with the fine people of #musl the suggestion was to add RTLD_GLOBAL to this line:
https://github.com/intel/compute-runtime/blob/master/shared/source/os_interface/linux/os_library_linux.cpp#L33
and it seems to work.
Yeah, I've prepared simple reproducer on alpine, and it looks like RTLD_GLOBAL flag is required to load symbol by dlsym in the library loaded by dlopen.
~ # cat foo.cpp
#include <stdio.h>
#include <dlfcn.h>
extern "C" {
int foo();
int boo();
}
typedef void (*fun_boo)();
int foo()
{
void *m = RTLD_DEFAULT;
printf("foo\n");
fun_boo libboo = (fun_boo) dlsym(m, "boo");
if (!libboo) {
printf("%s\n", dlerror());
return 22;
}
libboo();
return 0;
}
int boo()
{
printf("boo\n");
return 0;
}
~ # cat main.cpp
#include <dlfcn.h>
#include <stdio.h>
typedef int (*foo)();
int main()
{
int ret;
void * lib = dlopen("libfoo.so", RTLD_LAZY | RTLD_GLOBAL);
if (!lib) {
printf("%s\n", dlerror());
return 2;
}
foo libfoo = (foo) dlsym(lib, "foo");
if (!libfoo) {
printf("%s\n", dlerror());
return 22;
}
ret = libfoo();
dlclose(lib);
return ret;
}
~ # cat Makefile
all:
clang++ -g -shared -o libfoo.so foo.cpp
clang++ -g -o main main.cpp -ldl
~ # export LD_LIBRARY_PATH=`pwd`
~ # ./main
foo
boo
When I remove RTLD_GLOBAL flag in dlopen, there is error in dlsym to load boo.
~ # ./main
foo
Symbol not found: boo
The same test works correctly on Fedora without RTLD_GLOBAL flag.
with a few more minor changes i manage to compile most of it now. however at the "end" i get random errors, like:
Running igdrcl_tests 1x6x16 in /home/s/tasks/aports/ugly/compute-runtime/src/compute-runtime-20.08.15750/build/bin/tgllp
cd /home/s/tasks/aports/ugly/compute-runtime/src/compute-runtime-20.08.15750/build/bin && /usr/bin/cmake -E env GTEST_OUTPUT=xml:test_logs/test_details_tgllp_1_6_16.xml /home/s/tasks/aports/ugly/compute-runtime/src/compute-runtime-20.08.15750/build/bin/igdrcl_tests --product tgllp --slices 1 --subslices 6 --eu_per_ss 16 --gtest_catch_exceptions=1 --gtest_repeat=1 --gtest_shuffle --gtest_random_seed=0 --disable_default_listener
product family: tgllp (29)
set timeout to: 45
Iteration: 1. random_seed: 31532
unknown file: Failure
C++ exception with description "Abort was called at 121 line in file /home/s/tasks/aports/ugly/compute-runtime/src/compute-runtime-20.08.15750/core/memory_manager/gfx_partition.cpp" thrown in the test body.
[ FAILED ][ TGLLP ][ 31532 ] DeviceGenEngineTest.givenNonHwCsrModeWhenGetEngineThenDefaultEngineIsReturned
unknown file: Failure
C++ exception with description "Abort was called at 121 line in file /home/s/tasks/aports/ugly/compute-runtime/src/compute-runtime-20.08.15750/core/memory_manager/gfx_partition.cpp" thrown in the test body.
[ FAILED ][ TGLLP ][ 31532 ] DeviceGenEngineTest.whenCreateDeviceThenInternalEngineHasDefaultType
unknown file: Failure
C++ exception with description "Abort was called at 121 line in file /home/s/tasks/aports/ugly/compute-runtime/src/compute-runtime-20.08.15750/core/memory_manager/gfx_partition.cpp" thrown in the test body.
[ FAILED ][ TGLLP ][ 31532 ] DeviceGenEngineTest.givenHwCsrModeWhenGetEngineThenDedicatedForInternalUsageEngineIsReturned
SIGSEGV on: CommandEncodeSemaphore.whenAddingMiSemaphoreCommandThenExpectCompareFieldsAreSetCorrectly
Child aborted
make[2]: *** [unit_tests/CMakeFiles/run_tgllp_unit_tests.dir/build.make:62: run_tgllp_unit_tests] Error 1
the patches i applied to get this far can be seen in: https://github.com/aports-ugly/aports/commit/ee974430335a20f7d8645d909fa6733fdfc745eb
i'm happy to either elaborate the patches in this the previous comment in this issue (and then we could rename the issue to something like 'porting to alpine linux') or i can open separate issues for these patches, whatever is more convenient for you.