Yggdrasil icon indicating copy to clipboard operation
Yggdrasil copied to clipboard

[New Package] Add rocBLAS 4.2.0

Open jpsamaroo opened this issue 3 years ago • 19 comments
trafficstars

jpsamaroo avatar Jan 14 '22 17:01 jpsamaroo

I'm guessing the glibc dlopen failure might be due to usage of AVX2?

jpsamaroo avatar Jan 14 '22 17:01 jpsamaroo

amdci7 should support AVX2, and an ISA issue should probably throw a SIGILL error, not a segmentation fault

giordano avatar Jan 14 '22 17:01 giordano

Ok will try to track this down on amdci2

jpsamaroo avatar Jan 14 '22 17:01 jpsamaroo

Ah shoot the issue was not to be able to dlopen

vchuravy avatar Jan 14 '22 18:01 vchuravy

gdb points to a crash during library initialization, so I guess I should be on the lookout for "fancy" things that rocBLAS is trying to do during init.

jpsamaroo avatar Jan 17 '22 03:01 jpsamaroo

In case this is familiar to anyone:

#0  0xffffffffffffffff in ?? ()
#1  0x00007ffff7de38f3 in call_init (env=0x9fb0c0, argv=0x7fffffffde88, argc=4, l=<optimized out>) at dl-init.c:72
#2  _dl_init (main_map=main_map@entry=0xef7a60, argc=4, argv=0x7fffffffde88, env=0x9fb0c0) at dl-init.c:119
#3  0x00007ffff7de83bf in dl_open_worker (a=a@entry=0x7fffffff97a0) at dl-open.c:522
#4  0x00007ffff77261ef in __GI__dl_catch_exception (exception=0x7fffffff9780, operate=0x7ffff7de7f80 <dl_open_worker>, args=0x7fffffff97a0) at dl-error-skeleton.c:196
#5  0x00007ffff7de798a in _dl_open (file=0x7fffffff9ae0 "/home/jsamaroo/.julia/artifacts/00c0f592384c05d60db60eeba737077341004203/rocblas/lib/librocblas.so",
    mode=-2147483639, caller_dlopen=0x7ffff6a36cc9 <jl_load_dynamic_library+601>, nsid=<optimized out>, argc=4, argv=<optimized out>, env=0x9fb0c0) at dl-open.c:605
#6  0x00007ffff7bcff96 in dlopen_doit (a=a@entry=0x7fffffff99d0) at dlopen.c:66
#7  0x00007ffff77261ef in __GI__dl_catch_exception (exception=exception@entry=0x7fffffff9970, operate=0x7ffff7bcff40 <dlopen_doit>, args=0x7fffffff99d0)
    at dl-error-skeleton.c:196
#8  0x00007ffff772627f in __GI__dl_catch_error (objname=0x602270, errstring=0x602278, mallocedp=0x602268, operate=<optimized out>, args=<optimized out>)
    at dl-error-skeleton.c:215
#9  0x00007ffff7bd0745 in _dlerror_run (operate=operate@entry=0x7ffff7bcff40 <dlopen_doit>, args=args@entry=0x7fffffff99d0) at dlerror.c:162
#10 0x00007ffff7bd0051 in __dlopen (file=file@entry=0x7fffffff9ae0 "/home/jsamaroo/.julia/artifacts/00c0f592384c05d60db60eeba737077341004203/rocblas/lib/librocblas.so",
    mode=<optimized out>) at dlopen.c:87
#11 0x00007ffff6a36a69 in jl_dlopen (
    filename=filename@entry=0x7fffffff9ae0 "/home/jsamaroo/.julia/artifacts/00c0f592384c05d60db60eeba737077341004203/rocblas/lib/librocblas.so", flags=flags@entry=68)
    at /buildworker/worker/package_linux64/build/src/dlload.c:123
#12 0x00007ffff6a36cc9 in jl_load_dynamic_library (
    modname=0x7fffed006598 "/home/jsamaroo/.julia/artifacts/00c0f592384c05d60db60eeba737077341004203/rocblas/lib/librocblas.so", flags=<optimized out>, throw_err=1)
    at /buildworker/worker/package_linux64/build/src/dlload.c:267
#13 0x00007fffe2b2a69d in julia_#dlopen#3_21642 () at libdl.jl:117
#14 0x00007fffe2c0d3bf in dlopen () at libdl.jl:117

jpsamaroo avatar Jan 17 '22 04:01 jpsamaroo

The latest commit enables building with Tensile for two reasons:

  1. We'll need it for generating competitive BLAS kernels
  2. It might fix the segfault we're seeing (I would bet that AMD doesn't test rocBLAS builds without Tensile)

@haampie if you get the chance, I would appreciate if you could give some insight into why this build is failing. The ASM being compiled looks valid to me, even though the compiler disagrees.

jpsamaroo avatar Jan 25 '22 19:01 jpsamaroo

I have never dlopen'ed rocblas.so, so I'm afraid I can't help out :( isn't Tensile required to actually get blas 3 kernels at all?

haampie avatar Jan 25 '22 20:01 haampie

By the way, hipcc inlines everything by default, but that can be disabled: https://github.com/ROCm-Developer-Tools/HIP/blob/37cb3a34938af39303b73aceb2d7803f5c7ca7ca/bin/hipcc#L522-L525 maybe worth trying?

haampie avatar Jan 25 '22 20:01 haampie

Somehow, this PR has processes that are still running on the Yggdrasil workers. They all look like:

python3 /workspace/srcdir/rocBLAS-rocm-4.2.0/build/virtualenv/lib/python3.8/site-packages/Tensile/bin/TensileCreateLibrary --merge-files --no-short-file-names --no-library-print-debug --architecture=gfx900 --code-object-version=V3 --cxx-compiler=hipcc --library-format=msgpack /workspace/srcdir/rocBLAS-rocm-4.2.0/library/src/blas3/Tensile/Logic/asm_full /workspace/srcdir/rocBLAS-rocm-4.2.0/build/Tensile HIP

Somehow, they aren't dying properly. I've restarted the agents, but you should be aware that somehow this is causing problems.

staticfloat avatar Feb 27 '22 02:02 staticfloat

Running LD_DEBUG=all julia -e "Libc.Libdl.dlopen(\"./librocblas.so\")" gives a little bit more info: https://drive.google.com/file/d/1qqOaUzqtnPjNcAitHX9nU7ajJwtys-D7/view?usp=sharing

There are couple errors like this, although I'm not sure how important they are: 212944: /home/asmirnov/julia-1.7.3/bin/../lib/julia/libopenblas64_.so: error: symbol lookup error: undefined symbol: isamax_ (fatal)

But the whole process ends in a bit after 212944: calling init: ./librocblas.so:

    212944:	calling init: ./librocblas.so
    212944:	
    212944:	symbol=__cxa_guard_acquire;  lookup in file=./librocblas.so [0]
    212944:	symbol=__cxa_guard_acquire;  lookup in file=/home/asmirnov/julia-1.7.3/bin/../lib/julia/libgcc_s.so.1 [0]
    212944:	symbol=__cxa_guard_acquire;  lookup in file=/lib/x86_64-linux-gnu/libpthread.so.0 [0]
    212944:	symbol=__cxa_guard_acquire;  lookup in file=/lib/x86_64-linux-gnu/libm.so.6 [0]
    212944:	symbol=__cxa_guard_acquire;  lookup in file=/lib/x86_64-linux-gnu/librt.so.1 [0]
    212944:	symbol=__cxa_guard_acquire;  lookup in file=/home/asmirnov/code/rocm-bb/build/x86_64-linux-gnu-cxx11/5LJl6hOi/x86_64-linux-gnu-libgfortran5-cxx11/destdir/rocblas/lib/./../../hip/lib/libamdhip64.so.4 [0]
    212944:	symbol=__cxa_guard_acquire;  lookup in file=/lib/x86_64-linux-gnu/libdl.so.2 [0]
    212944:	symbol=__cxa_guard_acquire;  lookup in file=/home/asmirnov/julia-1.7.3/bin/../lib/julia/libz.so.1 [0]
    212944:	symbol=__cxa_guard_acquire;  lookup in file=/home/asmirnov/julia-1.7.3/bin/../lib/julia/libstdc++.so.6 [0]
    212944:	binding file /home/asmirnov/code/rocm-bb/build/x86_64-linux-gnu-cxx11/5LJl6hOi/x86_64-linux-gnu-libgfortran5-cxx11/destdir/rocblas/lib/./../../hip/lib/libamdhip64.so.4 [0] to /home/asmirnov/julia-1.7.3/bin/../lib/julia/libstdc++.so.6 [0]: normal symbol `__cxa_guard_acquire' [CXXABI_1.3]
    212944:	symbol=getenv;  lookup in file=./librocblas.so [0]
    212944:	symbol=getenv;  lookup in file=/home/asmirnov/julia-1.7.3/bin/../lib/julia/libgcc_s.so.1 [0]
    212944:	symbol=getenv;  lookup in file=/lib/x86_64-linux-gnu/libpthread.so.0 [0]
    212944:	symbol=getenv;  lookup in file=/lib/x86_64-linux-gnu/libm.so.6 [0]
    212944:	symbol=getenv;  lookup in file=/lib/x86_64-linux-gnu/librt.so.1 [0]
    212944:	symbol=getenv;  lookup in file=/home/asmirnov/code/rocm-bb/build/x86_64-linux-gnu-cxx11/5LJl6hOi/x86_64-linux-gnu-libgfortran5-cxx11/destdir/rocblas/lib/./../../hip/lib/libamdhip64.so.4 [0]
    212944:	symbol=getenv;  lookup in file=/lib/x86_64-linux-gnu/libdl.so.2 [0]
    212944:	symbol=getenv;  lookup in file=/home/asmirnov/julia-1.7.3/bin/../lib/julia/libz.so.1 [0]
    212944:	symbol=getenv;  lookup in file=/home/asmirnov/julia-1.7.3/bin/../lib/julia/libstdc++.so.6 [0]
    212944:	symbol=getenv;  lookup in file=/lib/x86_64-linux-gnu/libc.so.6 [0]
    212944:	binding file /home/asmirnov/code/rocm-bb/build/x86_64-linux-gnu-cxx11/5LJl6hOi/x86_64-linux-gnu-libgfortran5-cxx11/destdir/rocblas/lib/./../../hip/lib/libamdhip64.so.4 [0] to /lib/x86_64-linux-gnu/libc.so.6 [0]: normal symbol `getenv' [GLIBC_2.2.5]
    212944:	symbol=__cxa_guard_release;  lookup in file=./librocblas.so [0]
    212944:	symbol=__cxa_guard_release;  lookup in file=/home/asmirnov/julia-1.7.3/bin/../lib/julia/libgcc_s.so.1 [0]
    212944:	symbol=__cxa_guard_release;  lookup in file=/lib/x86_64-linux-gnu/libpthread.so.0 [0]
    212944:	symbol=__cxa_guard_release;  lookup in file=/lib/x86_64-linux-gnu/libm.so.6 [0]
    212944:	symbol=__cxa_guard_release;  lookup in file=/lib/x86_64-linux-gnu/librt.so.1 [0]
    212944:	symbol=__cxa_guard_release;  lookup in file=/home/asmirnov/code/rocm-bb/build/x86_64-linux-gnu-cxx11/5LJl6hOi/x86_64-linux-gnu-libgfortran5-cxx11/destdir/rocblas/lib/./../../hip/lib/libamdhip64.so.4 [0]
    212944:	symbol=__cxa_guard_release;  lookup in file=/lib/x86_64-linux-gnu/libdl.so.2 [0]
    212944:	symbol=__cxa_guard_release;  lookup in file=/home/asmirnov/julia-1.7.3/bin/../lib/julia/libz.so.1 [0]
    212944:	symbol=__cxa_guard_release;  lookup in file=/home/asmirnov/julia-1.7.3/bin/../lib/julia/libstdc++.so.6 [0]
    212944:	binding file /home/asmirnov/code/rocm-bb/build/x86_64-linux-gnu-cxx11/5LJl6hOi/x86_64-linux-gnu-libgfortran5-cxx11/destdir/rocblas/lib/./../../hip/lib/libamdhip64.so.4 [0] to /home/asmirnov/julia-1.7.3/bin/../lib/julia/libstdc++.so.6 [0]: normal symbol `__cxa_guard_release' [CXXABI_1.3]

signal (11): Segmentation fault
in expression starting at none:1
unknown function (ip: (nil))
Allocations: 2721 (Pool: 2711; Big: 10); GC: 0

pxl-th avatar Jul 01 '22 12:07 pxl-th

Here's also readelf output.

$ readelf -d librocblas.so

Dynamic section at offset 0x1aae680 contains 37 entries:
  Tag        Type                         Name/Value
 0x0000000000000001 (NEEDED)             Shared library: [libgcc_s.so.1]
 0x0000000000000001 (NEEDED)             Shared library: [libpthread.so.0]
 0x0000000000000001 (NEEDED)             Shared library: [libm.so.6]
 0x0000000000000001 (NEEDED)             Shared library: [librt.so.1]
 0x0000000000000001 (NEEDED)             Shared library: [libamdhip64.so.4]
 0x0000000000000001 (NEEDED)             Shared library: [libdl.so.2]
 0x0000000000000001 (NEEDED)             Shared library: [libz.so.1]
 0x0000000000000001 (NEEDED)             Shared library: [libstdc++.so.6]
 0x0000000000000001 (NEEDED)             Shared library: [libc.so.6]
 0x0000000000000001 (NEEDED)             Shared library: [ld-linux-x86-64.so.2]
 0x000000000000000e (SONAME)             Library soname: [librocblas.so.0]
 0x000000000000001d (RUNPATH)            Library runpath: [$ORIGIN/../../lib:$ORIGIN/../../hip/lib]
 0x000000000000000c (INIT)               0x41000
 0x000000000000000d (FINI)               0x6a9c80
 0x0000000000000019 (INIT_ARRAY)         0x1aa3668
 0x000000000000001b (INIT_ARRAYSZ)       3176 (bytes)
 0x000000000000001a (FINI_ARRAY)         0x1aa42d0
 0x000000000000001c (FINI_ARRAYSZ)       16 (bytes)
 0x000000006ffffef5 (GNU_HASH)           0x1ad5158
 0x0000000000000005 (STRTAB)             0x1ac3000
 0x0000000000000006 (SYMTAB)             0x1e70
 0x000000000000000a (STRSZ)              74070 (bytes)
 0x000000000000000b (SYMENT)             24 (bytes)
 0x0000000000000003 (PLTGOT)             0x1aaf910
 0x0000000000000002 (PLTRELSZ)           15336 (bytes)
 0x0000000000000014 (PLTREL)             RELA
 0x0000000000000017 (JMPREL)             0x3cee0
 0x0000000000000007 (RELA)               0x1cd98
 0x0000000000000008 (RELASZ)             131400 (bytes)
 0x0000000000000009 (RELAENT)            24 (bytes)
 0x000000000000001e (FLAGS)              BIND_NOW
 0x000000006ffffffb (FLAGS_1)            Flags: NOW
 0x000000006ffffffe (VERNEED)            0x1cb48
 0x000000006fffffff (VERNEEDNUM)         7
 0x000000006ffffff0 (VERSYM)             0x1c086
 0x000000006ffffff9 (RELACOUNT)          4599
 0x0000000000000000 (NULL)               0x0

pxl-th avatar Jul 01 '22 12:07 pxl-th

@pxl-th dumping the INIT_ARRAY contents may also be interesting, because the thing dlopen trips on is a null first entry in that array (determined via gdb).

jpsamaroo avatar Jul 01 '22 12:07 jpsamaroo

.init section for rocblas 4.2 (binarybuilder): download

asmirnov@amdjl:~/code/rocm-bb/build/x86_64-linux-gnu-cxx11/5LJl6hOi/destdir/rocblas/lib$ readelf -x .init librocblas.so

Hex dump of section '.init':
  0x00041000 4883ec08 e89b6100 00e8ad61 00004883 H.....a....a..H.
  0x00041010 c408c3                              ...

.init_array section for rocblas 4.2 (binarybuilder): download

~/code/rocm-bb/build/x86_64-linux-gnu-cxx11/5LJl6hOi/destdir/rocblas/lib$ readelf -x .init_array librocblas.so

Hex dump of section '.init_array':
  0x01aa3668 ffffffff ffffffff d0380400 00000000 .........8......

.init section for rocblas 5.0 (system-wide installation): download

asmirnov@amdjl:/opt/rocm/rocblas/lib$ readelf -x .init librocblas.so

Hex dump of section '.init':
  0x114e059c 4883ec08 488b0549 23010048 85c07402 H...H..I#..H..t.
  0x114e05ac ffd04883 c408c3                     ..H....

.init_array section for rocblas 5.0 (system-wide installation): download

asmirnov@amdjl:/opt/rocm/rocblas/lib$ readelf -x .init_array librocblas.so

Hex dump of section '.init_array':
  0x114e3bb0 00000000 00000000 00000000 00000000 ................

pxl-th avatar Jul 01 '22 13:07 pxl-th

For some reason, when dumping .init_array section via objdump gives empty results

pxl-th avatar Jul 01 '22 13:07 pxl-th

Backtrace of gdb --args julia -e "Libc.Libdl.dlopen(\"./librocblas.so\")" from the binary builder:

(gdb) bt full
#0  0xffffffffffffffff in ?? ()
No symbol table info available.
#1  0x00007ffff7fc947e in call_init (l=<optimized out>, argc=argc@entry=3, argv=argv@entry=0x7fffffffdd48, env=env@entry=0x8468c0) at ./elf/dl-init.c:70
        j = 0
        jm = <optimized out>
        addrs = <optimized out>
        init_array = <optimized out>
        __PRETTY_FUNCTION__ = "call_init"
#2  0x00007ffff7fc9568 in call_init (env=0x8468c0, argv=0x7fffffffdd48, argc=3, l=<optimized out>) at ./elf/dl-init.c:33
        init_array = <optimized out>
        __PRETTY_FUNCTION__ = "call_init"
        j = <optimized out>
        jm = <optimized out>
        addrs = <optimized out>
#3  _dl_init (main_map=0xa53b20, argc=3, argv=0x7fffffffdd48, env=0x8468c0) at ./elf/dl-init.c:117
        preinit_array = <optimized out>
        preinit_array_size = <optimized out>
        i = <optimized out>
#4  0x00007ffff7eeac85 in __GI__dl_catch_exception (exception=<optimized out>, operate=<optimized out>, args=<optimized out>) at ./elf/dl-error-skeleton.c:182
        old = <optimized out>
        errcode = 0
        c = {exception = 0x7fffffff9530, errcode = 0x7fffffff943c, env = {{__jmpbuf = {140737488328400, -548524226938750696, -16, 140737488327984, 3, 2147483657, 
                -548524226984888040, -548506505144125160}, __mask_was_saved = 0, __saved_mask = {__val = {0 <repeats 16 times>}}}}}
        old = <optimized out>
#5  0x00007ffff7fd0ff6 in dl_open_worker (a=0x7fffffff96d0) at ./elf/dl-open.c:808
        init_args = {new = 0xa53b20, argc = 3, argv = 0x7fffffffdd48, env = 0x8468c0}
        args = <optimized out>
        mode = -2147483639
        new = 0xa53b20
        args = <optimized out>
        mode = <optimized out>
        new = <optimized out>
        ex = <optimized out>
        err = <optimized out>
        init_args = <optimized out>
#6  dl_open_worker (a=a@entry=0x7fffffff96d0) at ./elf/dl-open.c:771
        args = 0x7fffffff96d0
        mode = <optimized out>
        new = <optimized out>
        init_args = <optimized out>
#7  0x00007ffff7eeac28 in __GI__dl_catch_exception (exception=<optimized out>, operate=<optimized out>, args=<optimized out>) at ./elf/dl-error-skeleton.c:208
        errcode = 0
        c = {exception = 0x7fffffff96b0, errcode = 0x7fffffff95ac, env = {{__jmpbuf = {-2, -548524226938750696, -16, 140737354127944, 3, 2147483657, -548524227033122536, 
                -548506505144125160}, __mask_was_saved = 0, __saved_mask = {__val = {0 <repeats 16 times>}}}}}
        old = 0x7fffffff97b0
#8  0x00007ffff7fd134e in _dl_open (file=<optimized out>, mode=-2147483639, caller_dlopen=0x7ffff6ce5cc9 <jl_load_dynamic_library+601>, nsid=-2, argc=3, argv=<optimized out>, 
    env=0x8468c0) at ./elf/dl-open.c:883
        args = {file = 0x7fffffff9a50 "./librocblas.so", mode = -2147483639, caller_dlopen = 0x7ffff6ce5cc9 <jl_load_dynamic_library+601>, map = 0xa53b20, nsid = 0, 
          original_global_scope_pending_adds = 0, libc_already_loaded = true, worker_continue = true, argc = 3, argv = 0x7fffffffdd48, env = 0x8468c0}
        exception = {objname = 0x0, errstring = 0x7fffffff9a50 "./librocblas.so", message_buffer = 0x7fffffffaa4f ""}
        errcode = <optimized out>
        __PRETTY_FUNCTION__ = "_dl_open"
#9  0x00007ffff7e066bc in dlopen_doit (a=a@entry=0x7fffffff9940) at ./dlfcn/dlopen.c:56
        args = 0x7fffffff9940
#10 0x00007ffff7eeac28 in __GI__dl_catch_exception (exception=exception@entry=0x7fffffff98a0, operate=<optimized out>, args=<optimized out>) at ./elf/dl-error-skeleton.c:208
        errcode = 0
        c = {exception = 0x7fffffff98a0, errcode = 0x7fffffff97ac, env = {{__jmpbuf = {140737488328951, -548524226882127592, -16, 140737154187376, 68, 140737155548440, -548524226966013672, -548506505144125160}, __mask_was_saved = 0, __saved_mask = {__val = {140737351494352, 4287062190, 140737351489484, 10647808, 140737353934825, 1808, 140737351566928, 140737353858448, 140737488328936, 140737488328932, 12234695464575742208, 140737351566928, 140737488329296, 140737488337536, 1, 140737154187376}}}}}
        old = 0x0
#11 0x00007ffff7eeacf3 in __GI__dl_catch_error (objname=0x7fffffff98f8, errstring=0x7fffffff9900, mallocedp=0x7fffffff98f7, operate=<optimized out>, args=<optimized out>) at ./elf/dl-error-skeleton.c:227
        exception = {objname = 0x3000000018 <error: Cannot access memory at address 0x3000000018>, errstring = 0x7fffffff9980 "", message_buffer = 0x7fffffff98c0 "\250\377\377\377\377\377\377\377"}
        errorcode = <optimized out>
#12 0x00007ffff7e061ae in _dlerror_run (operate=operate@entry=0x7ffff7e06660 <dlopen_doit>, args=args@entry=0x7fffffff9940) at ./dlfcn/dlerror.c:138
        result = <optimized out>
        objname = 0x7ffff6de2f28 <ptrhash_get+56> "H\215{\377\061\311L\215E\377H!\370H\215<"
        errstring = 0x7fffec150070 "\220\272\377\377\377\177"
        malloced = false
        errcode = <optimized out>
#13 0x00007ffff7e06748 in dlopen_implementation (dl_caller=<optimized out>, mode=<optimized out>, file=0x7fffffff9a50 "./librocblas.so") at ./dlfcn/dlopen.c:71
        args = {file = 0x7fffffff9a50 "./librocblas.so", mode = 9, new = 0x1, caller = 0x7ffff6ce5cc9 <jl_load_dynamic_library+601>}
#14 ___dlopen (file=file@entry=0x7fffffff9a50 "./librocblas.so", mode=<optimized out>) at ./dlfcn/dlopen.c:81
No locals.
#15 0x00007ffff6ce5a69 in jl_dlopen (filename=filename@entry=0x7fffffff9a50 "./librocblas.so", flags=flags@entry=68) at /buildworker/worker/package_linux64/build/src/dlload.c:123
No locals.
#16 0x00007ffff6ce5cc9 in jl_load_dynamic_library (modname=0x7fffec29c518 "./librocblas.so", flags=<optimized out>, throw_err=1) at /buildworker/worker/package_linux64/build/src/dlload.c:267
        ext = 0x7ffff6e3e159 ""
        path = "./librocblas.so\000\220\256\377\377\377\177\000\000H\335\377\377\377\177\000\000\300h\204\000\000\000\000\000\256\215\375\367\377\177", '\000' <repeats 19 times>, "Bi\000\000\000\000\000\000\277\227\242\377\177\000\000\316\302n\242\377\177\000\000@y\242", '\000' <repeats 45 times>, "\240\037\000\000\377\377\002", '\000' <repeats 105 times>...
        relocated = "\001\000\000\000\377\177\000\000\060$\245\000\000\000\000\000\060\252\377\377\377\177\000\000|l\374\367\377\177\000\000\001\000\000\000\377\177\000\000\320\036\245\000\000\000\000\000P\252\377\377\377\177\000\000|l\374\367\377\177\000\000\001\000\000\000\377\177\000\000\060\022\245\000\000\000\000\000p\252\377\377\377\177\000\000|l\374\367\377\177\000\000\001\000\000\000\000\000\000\000\020\031\245\000\000\000\000\000\220\252\377\377\377\177\000\000\003\000\000\000\000\000\000\000\000\373(\000\000\000\000\000\240\266\377\377\377\177\000\000\300\030\245\000\000\000\000\000\006\000\000\000\000\000\000\000@\000\240\233\377\177\000\000\260\252\377\377\377\177\000\000\000\000\000\000\000\000\000\000\300\253\377\377\377\177\000\000\220\251\377\377\377\177\000\000\000\000\000\000"...
        i = 0
        stbuf = {st_dev = 140737351537744, st_mode = 140737353858448, st_nlink = 0, st_uid = 0, st_gid = 0, st_rdev = 0, st_ino = 0, st_size = 0, st_blksize = 0, st_blocks = 19, st_flags = 10646368, st_gen = 140735921081752, st_atim = {tv_sec = 8677568, tv_nsec = 0}, st_mtim = {tv_sec = 0, tv_nsec = 140737353965425}, st_ctim = {tv_sec = 5, tv_nsec = 0}, st_birthtim = {tv_sec = 140737351537744, tv_nsec = 140737351756656}}
        handle = <optimized out>
        abspath = <optimized out>
        is_atpath = 0
        n_extensions = <optimized out>
#17 0x00007fffe22490dd in julia_#dlopen#3_30656 () at libdl.jl:117
No locals.
#18 0x00007fffe2248e7f in dlopen () at libdl.jl:117
No locals.
#19 julia_dlopen_30644 () at libdl.jl:117
No locals.
#20 0x00007fffe2248ef8 in jfptr_dlopen_30645.clone_1 () from /home/pxl-th/bin/julia-1.7.2/lib/julia/sys.so
No symbol table info available.
#21 0x00007ffff6cc4e0a in _jl_invoke (world=31320, mfunc=<optimized out>, nargs=1, args=0x7fffffffbc38, F=0x7fffe5778ed0 <jl_system_image_data+41475088>) at /buildworker/worker/package_linux64/build/src/gf.c:2247
        last_alloc = <optimized out>
        invoke = <optimized out>
        codeinst = <optimized out>
        last_errno = <optimized out>
        res = 0x7fff9a9cc668 <__CTOR_LIST__>
        codeinst = <optimized out>
        last_alloc = <optimized out>
        last_errno = <optimized out>
        invoke = <optimized out>
        res = <optimized out>
        __atomic_load_ptr = <optimized out>
        __atomic_load_tmp = <optimized out>
        invoke = <optimized out>
        __atomic_load_ptr = <optimized out>
        __atomic_load_tmp = <optimized out>
        res = <optimized out>
        __atomic_load_ptr = <optimized out>
        __atomic_load_tmp = <optimized out>
        __atomic_load_ptr = <optimized out>
        __atomic_load_tmp = <optimized out>
#22 jl_apply_generic (F=<optimized out>, args=0x7fffffffbc38, nargs=<optimized out>) at /buildworker/worker/package_linux64/build/src/gf.c:2429
        world = 31320
        mfunc = <optimized out>
#23 0x00007ffff6ce3e96 in jl_apply (nargs=2, args=0x7fffffffbc30) at /buildworker/worker/package_linux64/build/src/julia.h:1788
No locals.
#24 do_call (args=args@entry=0x7fffec2361b8, nargs=nargs@entry=2, s=s@entry=0x7fffffffbec0) at /buildworker/worker/package_linux64/build/src/interpreter.c:126
        argv = 0x7fffffffbc30
        i = <optimized out>
        result = <optimized out>
#25 0x00007ffff6ce390e in eval_value (e=e@entry=0x7fffec29c750, s=s@entry=0x7fffffffbec0) at /buildworker/worker/package_linux64/build/src/interpreter.c:215
        src = <optimized out>
        ex = <optimized out>
        args = 0x7fffec2361b8
        nargs = 2
        head = <optimized out>
#26 0x00007ffff6ce46d2 in eval_stmt_value (s=0x7fffffffbec0, stmt=<optimized out>) at /buildworker/worker/package_linux64/build/src/interpreter.c:166
        res = <optimized out>
        res = <optimized out>
#27 eval_body (stmts=<optimized out>, s=s@entry=0x7fffffffbec0, ip=2, ip@entry=0, toplevel=toplevel@entry=1) at /buildworker/worker/package_linux64/build/src/interpreter.c:587
        head = 0x7ffff022a740
        stmt = <optimized out>
        next_ip = 3
        __eh = {eh_ctx = {{__jmpbuf = {140737154187376, 0, 140737002174304, 6933174, -548524225676265192, -548504045595870952, -548524227720904704, -548504046106003176}, __mask_was_saved = 0, __saved_mask = {__val = {140737338629216, 5, 4294967291, 0, 140737002174304, 6933174, 140737335086248, 140737155089008, 140737338629216, 140737488338496, 140737335086608, 4, 140733193388080, 140737488338576, 140737488338256, 140737488339328}}}}, gcstack = 0x0, prev = 0x0, gc_state = 1 '\001', locks_len = 6933174, defer_signal = 31320, timing_stack = 0x7ffff7135c60 <jl_ast_main_ctx>, world_age = 140737334009073}
        ns = <optimized out>
        ct = <optimized out>
#28 0x00007ffff6ce52f8 in jl_interpret_toplevel_thunk (m=m@entry=0x7fffe3057760 <jl_system_image_data+443552>, src=0x7fffec294190) at /buildworker/worker/package_linux64/build/src/interpreter.c:731
        s = 0x7fffffffbec0
        nroots = <optimized out>
        stmts = <optimized out>
        ct = 0x7fffec150010
        last_age = 31320
        r = <optimized out>
#29 0x00007ffff6d027a4 in jl_toplevel_eval_flex (m=m@entry=0x7fffe3057760 <jl_system_image_data+443552>, e=<optimized out>, fast=fast@entry=1, expanded=expanded@entry=0) at /buildworker/worker/package_linux64/build/src/toplevel.c:885
        ct = 0x7fffec150010
        ex = 0x7fffec29c610
        mfunc = 0x0
        thk = 0x7fffec294190
        __gc_stkf = {0xd, 0x7fffffffc0f0, 0x7fffffffbfb8, 0x7fffffffbfc0, 0x7fffffffbfb0}
        last_age = <optimized out>
        head = <optimized out>
        has_intrinsics = 0
        has_defs = 0
        has_loops = <optimized out>
        has_opaque = 0
        result = <optimized out>
#30 0x00007ffff6d029e5 in jl_toplevel_eval_flex (m=m@entry=0x7fffe3057760 <jl_system_image_data+443552>, e=e@entry=0x7fffec29c470, fast=fast@entry=1, expanded=expanded@entry=0) at /buildworker/worker/package_linux64/build/src/toplevel.c:830
        res = <optimized out>
        i = <optimized out>
        ct = 0x7fffec150010
        ex = 0x7fffec29c470
        mfunc = 0x0
        thk = 0x0
        __gc_stkf = {0xd, 0x7fffffffca80, 0x7fffffffc0a8, 0x7fffffffc0b0, 0x7fffffffc0a0}
        last_age = <optimized out>
        head = <optimized out>
        has_intrinsics = -15896
        has_defs = 0
        has_loops = <optimized out>
        has_opaque = -267013920
        result = <optimized out>
#31 0x00007ffff6d0450c in jl_toplevel_eval (m=m@entry=0x7fffe3057760 <jl_system_image_data+443552>, v=v@entry=0x7fffec29c470) at /buildworker/worker/package_linux64/build/src/toplevel.c:894
No locals.
#32 0x00007ffff6d0462a in jl_toplevel_eval_in (m=0x7fffe3057760 <jl_system_image_data+443552>, ex=0x7fffec29c470) at /buildworker/worker/package_linux64/build/src/toplevel.c:944
        ct = <optimized out>
        v = <optimized out>
        last_lineno = 0
        last_filename = 0x7ffff6e0e0aa "none"
        i__tr = 1
        i__ca = <optimized out>
        __eh = {eh_ctx = {{__jmpbuf = {140737488339328, -548524225458161384, 2, 140737221308424, 140737155127504, 2, -548524225552533224, -548504292363814632}, __mask_was_saved = 0, __saved_mask = {__val = {140737347929461, 140737488339600, 12234695464575742208, 140737155548272, 140737488339504, 140736997135696, 140737334107794, 12, 140736997135696, 2, 140737221308424, 140737155548272, 2, 140737488339568, 140737333441778, 140737488339536}}}}, gcstack = 0x7fffffffca80, prev = 0x7fffffffd740, gc_state = 0 '\000', locks_len = 0, defer_signal = 0, timing_stack = 0x7fffffffd640, world_age = 31320}
        __excstack_state = <optimized out>
#33 0x00007fffe2acf7e8 in eval () at boot.jl:373
No locals.
#34 julia_exec_options_33549 () at client.jl:268
No locals.
#35 0x00007fffe258a0f8 in julia__start_38731 () at client.jl:495
No locals.
#36 0x00007fffe258a269 in jfptr.start_38732.clone_1 () from /home/pxl-th/bin/julia-1.7.2/lib/julia/sys.so
No symbol table info available.
#37 0x00007ffff6cc4e0a in _jl_invoke (world=31320, mfunc=<optimized out>, nargs=0, args=0x7fffffffd990, F=0x7fffe3c527c0 <jl_system_image_data+13006080>) at /buildworker/worker/package_linux64/build/src/gf.c:2247
        last_alloc = <optimized out>
        invoke = <optimized out>
        codeinst = <optimized out>
        last_errno = <optimized out>
        res = 0x7fff9a9cc668 <__CTOR_LIST__>
        codeinst = <optimized out>
        last_alloc = <optimized out>
        last_errno = <optimized out>
        invoke = <optimized out>
        res = <optimized out>
        __atomic_load_ptr = <optimized out>
        __atomic_load_tmp = <optimized out>
        invoke = <optimized out>
        __atomic_load_ptr = <optimized out>
        __atomic_load_tmp = <optimized out>
        res = <optimized out>
        __atomic_load_ptr = <optimized out>
        __atomic_load_tmp = <optimized out>
        __atomic_load_ptr = <optimized out>
        __atomic_load_tmp = <optimized out>
#38 jl_apply_generic (F=<optimized out>, args=0x7fffffffd990, nargs=<optimized out>) at /buildworker/worker/package_linux64/build/src/gf.c:2429
        world = 31320
        mfunc = <optimized out>
#39 0x00007ffff6d282d6 in jl_apply (nargs=1, args=0x7fffffffd988) at /buildworker/worker/package_linux64/build/src/julia.h:1788
No locals.
#40 true_main (argc=<optimized out>, argv=<optimized out>) at /buildworker/worker/package_linux64/build/src/jlapi.c:559
        ct = 0x7fffec150010
        last_age = 1
        i__tr = 1
        i__ca = 1
        __eh = {eh_ctx = {{__jmpbuf = {140737488345536, -548524224696895208, 0, 140737488346464, 0, 140737354125376, -548524224747226856, -548504273189816040}, __mask_was_saved = 0, __saved_mask = {__val = {17898239780003166488, 0, 140737333972490, 140737334387600, 0 <repeats 12 times>}}}}, gcstack = 0x0, prev = 0x0, gc_state = 0 '\000', locks_len = 0, defer_signal = 0, timing_stack = 0x0, world_age = 1}
        __excstack_state = <optimized out>
        start_client = 0x7fffe3c527c0 <jl_system_image_data+13006080>
#41 0x00007ffff6d28c7d in jl_repl_entrypoint (argc=<optimized out>, argv=<optimized out>) at /buildworker/worker/package_linux64/build/src/jlapi.c:701
        lisp_prompt = <optimized out>
        orig_argv = <optimized out>
        ret = <optimized out>
#42 0x00000000004007d9 in main (argc=<optimized out>, argv=<optimized out>) at /buildworker/worker/package_linux64/build/cli/loader_exe.c:42
        ret = <optimized out>

pxl-th avatar Jul 01 '22 21:07 pxl-th

@jpsamaroo does #0 0xffffffffffffffff in ?? () at the top mean what you thought, that it tries to execute -1 while it should ignore it?

pxl-th avatar Jul 01 '22 21:07 pxl-th

@pxl-th I believe that is the case, it tries to jump to the -1 address and segfaults.

jpsamaroo avatar Jul 02 '22 12:07 jpsamaroo

Ok, I think I have an idea of what the issue is. It appears that we're mixing up some conventions for how musl vs. glibc do constructors, where musl appears to use -1 as a sentinel for "end of ctors list", while glibc uses 0 for the same purpose. I have no idea why a -1 got inserted when there are ctors to run, but it must be related to link ordering, where somehow the -1 (which should be at the end to signal completion) ended up at the front. I would guess that we accidentally linked both the ctor implementation for musl and glibc (in that order probably). This is probably an issue with how I patched hipcc in HIP_jll.

What's odd is that I still see this behavior in the musl build, where I wouldn't expect to see the terminator be 0 (I would expect -1).

jpsamaroo avatar Jul 02 '22 13:07 jpsamaroo

Superseded by #5441

jpsamaroo avatar Sep 16 '22 18:09 jpsamaroo