aomp icon indicating copy to clipboard operation
aomp copied to clipboard

Use system's default rocgdb instead of AOMP's

Open saiislam opened this issue 1 year ago • 7 comments
trafficstars

rocgdb requires libpython.so which is more likely to be found by the system's default rocgdb.

The one in AOMP/bin/rocgdb complains about missing libpython.so file.

saiislam avatar Mar 06 '24 19:03 saiislam

I don't think we want to test the sytem's rocgdb (by accident or on purpose).

jplehr avatar Mar 06 '24 22:03 jplehr

I also wonder why we don’t want to test the rocgdb we built and packed ?

ronlieb avatar Mar 07 '24 00:03 ronlieb

I am seeing a different complaint instead of a missing libpython:

[r6 ~]$ /COD/LATEST/aomp/bin/rocgdb
amd-dbgapi library version mismatch, got 0.70.1, need 0.71+

Seems to have started on:

[r6 ~]$ /COD/2023-12-20/aomp/bin/rocgdb
amd-dbgapi library version mismatch, got 0.70.1, need 0.71+

Works before that date:

[r6 ~]$ /COD/2023-12-19/aomp/bin/rocgdb
GNU gdb (AOMP_18.0-1) 13.2
...
For help, type "help".
Type "apropos word" to search for commands related to "word".
(gdb) q

dpalermo avatar Mar 07 '24 00:03 dpalermo

We are actually staging python libs into COD to allow tools that are linked to specific versions of python shared objects to work. If you see a missing python lib error, paste in the exact error message and the system you saw it on.

dpalermo avatar Mar 07 '24 00:03 dpalermo

We are actually staging python libs into COD to allow tools that are linked to specific versions of python shared objects to work. If you see a missing python lib error, paste in the exact error message and the system you saw it on.

I am getting same error irrespective of using 2024-03-07 build or 2023-12-04 build.

Note: results are on r11

/COD/2024-03-07/aomp/bin/clang++  -g -O0    -fopenmp --offload-arch=gfx90a  -D__OFFLOAD_ARCH_gfx90a__ clang-325070.cpp -o clang-325070
/COD/2024-03-07/aomp/bin/rocgdb -x doit.gdb --args ./clang-325070 0 2>&1 | tee run.log
/COD/2024-03-07/aomp/bin/rocgdb: error while loading shared libraries: libpython3.8.so.1.0: cannot open shared object file: No such file or directory
make: *** [../Makefile.rules:71: run] Error 127
/COW/2023-12-04/aomp/bin/clang++  -g -O0    -fopenmp --offload-arch=gfx90a  -D__OFFLOAD_ARCH_gfx90a__ clang-325070.cpp -o clang-325070
/COW/2023-12-04/aomp/bin/rocgdb -x doit.gdb --args ./clang-325070 0 2>&1 | tee run.log
/COW/2023-12-04/aomp/bin/rocgdb: error while loading shared libraries: libpython3.8.so.1.0: cannot open shared object file: No such file or directory
make: *** [../Makefile.rules:71: run] Error 127

saiislam avatar Mar 07 '24 18:03 saiislam

Looking back at the original thread on the 'CI OpenMP compiler daily triage group' teams chat motivated the staging fix, you will need to do the following on a 22.04 system:

[r11 ~]$ PYTHONHOME=/COD/LATEST/aomp/lib/python3.8 PYTHONPATH=/COD/LATEST/aomp/lib/python3.8  LD_LIBRARY_PATH=/COD/LATEST/aomp/lib /COD/LATEST/aomp/bin/rocgdb
GNU gdb (AOMP_19.0-0) 13.2
...
For help, type "help".
Type "apropos word" to search for commands related to "word".
(gdb)

Also note that setting the above env vars also fixes the 'amd-dbgapi library version mismatch, got 0.70.1, need 0.71+' error now seen on 20.04 systems.

Not a "fix" so much as a workaround for running rocgdb built on an older OS.

Not that it helps us in this situation, but the moral of the story is don't link your product with the python shared objects. There is just no backward compatibility guaranteed (at least not building on 20.04 and running on 22.04).

dpalermo avatar Mar 07 '24 19:03 dpalermo

The 'amd-dbgapi library version mismatch, got 0.70.1, need 0.71+' error is even a problem on the same system where rocgdb was built. Without specifying LD_LIBRARY_PATH, it is picking up the library from the system /opt/rocm:

[r5 /COD/LATEST/aomp]$ ldd /COD/LATEST/aomp/bin/rocgdb | grep dbgapi
        librocm-dbgapi.so.0 => /opt/rocm-5.7.0/lib/librocm-dbgapi.so.0 (0x00007f8046edb000)

Gets the staged librocm-dbgapi.so.0 with the workaround:

[r5 /COD/LATEST/aomp]$ PYTHONHOME=/COD/LATEST/aomp/lib/python3.8 PYTHONPATH=/COD/LATEST/aomp/lib/python3.8  LD_LIBRARY_PATH=/COD/LATEST/aomp/lib ldd /COD/LATEST/aomp/bin/rocgdb | grep dbgapi
        librocm-dbgapi.so.0 => /COD/LATEST/aomp/lib/librocm-dbgapi.so.0 (0x00007f25342d1000)

This issue feels like a cmake bug in rocgdb, as it should try to pick up shared libraries relative to it's installed location first.

dpalermo avatar Mar 07 '24 19:03 dpalermo