pykokkos-base icon indicating copy to clipboard operation
pykokkos-base copied to clipboard

inline static members in Kokkos 4.0 class not persistent with CUDA backend

Open kaschau opened this issue 2 years ago • 10 comments

Kokkos 4.0 changed many class members set with Kokkos::initialize() to inline static T types. With this change it seems there is an issue with pybind11 and setting these members persistently when called from python.

Whenever using cuda, the TileSizeProperties attribute maxThreads is being set to zeros, and causes an abort at the first MDRange execution.

When Kokkos::initialize() is called (from python bound function), cudaProp.maxThreadsPerMultiProcessor (from here ) reports 1024, however, by the time we get to the MDRange policy here, the space.impl_internal_space_instance()->m_maxThreadsPerSM is 0. This causes an abort at this check here.

I am only having an issue with CUDA, and it works fine with OpenMP and Serial backends. It has been consistent with every host/device compiler I have tried.

Primarily gcc 9.4.0/intel19.04 + CUDA 11.7

kaschau avatar Apr 20 '23 14:04 kaschau

To reproduce I would expect any CUDA kernel to fail when Kokkos::Initialize() is called from pykokkos-base, and a subsequent kokkos kernel is called. I cannot reproduce in Kokkos/C++ only code.

kaschau avatar Apr 21 '23 14:04 kaschau

It seems like the inline static member behavior is different when Kokkos is compiled as a static versus a shared library. Because pybind11 requires PIC, generally one just compiles Kokkos as a shared library, so there are no problems when compiling pykokkos-base. However, this leads to the behavior described above (with 4.0).

However, when I compile Kokkos as static libraries, with -fPIC, I am able to get Kokkos 4.0 to run on the cuda backend.

This is well over my compiling/C++ object lifetime/ instruction unit pay grade, so not sure what to make of it. But at least it works.

kaschau avatar May 13 '23 14:05 kaschau

hm interesting. @nliber do you have any idea what this could be? I think it is potentially the jitting of stuff where we would have inline static things inside header files? So if something gets recompiled and then relinked it might cause issues?

I wonder if this is fixable by having all inline-static variables actually be static variables inside functions which are compiled inside the Kokkos library itself. I.e. for every static int foo; make it actually static int& foo(); and have int& foo() { static int val; return val; } somewhere?

crtrott avatar May 13 '23 17:05 crtrott

@kaschau do you feel you could take this experiment on, i.e. make a branch of Kokkos Core go through all these variables and see if we can get this fixed that way?

crtrott avatar May 13 '23 17:05 crtrott

@crtrott I'm a c++ ignoramos but I think I can give it a shot. I think just being able to prove one variable (the tile size for example) survives this way should be doable for me, as a proof of concept.

kaschau avatar May 13 '23 19:05 kaschau

@kaschau A bit of a shot in the dark but try setting this variable to OFF and rebuild pykokkos-base:

https://github.com/kokkos/pykokkos-base/blob/94553b7e4be91b042baa9d903dc98e73722eeced/cmake/Modules/KokkosPythonOptions.cmake#L82

I suspect the reason you see this issue with shared libraries is there is some symbol that exists in both the pykokkos-base library and the Kokkos library and pykokkos-base is initializing it's copy of the symbol instead of the one that exists in the Kokkos library. And when a static Kokkos library is used, these symbols get merged.

jrmadsen avatar May 16 '23 04:05 jrmadsen

@kaschau do you feel you could take this experiment on, i.e. make a branch of Kokkos Core go through all these variables and see if we can get this fixed that way?

A potential starting place might be to use the nm command line tool and see which Kokkos variables are defined in the text section of the pykokkos-base library. man nm will explain the codes for whether a symbol is undefined (i.e. defined in another library), a symbol defined in the text section, etc. Filter out any pybind symbols and see if there are any symbols defined in both the Kokkos shared library and pykokkos-base library that look suspicious.

jrmadsen avatar May 16 '23 04:05 jrmadsen

@kaschau A bit of a shot in the dark but try setting this variable to OFF and rebuild pykokkos-base:

https://github.com/kokkos/pykokkos-base/blob/94553b7e4be91b042baa9d903dc98e73722eeced/cmake/Modules/KokkosPythonOptions.cmake#L82

I suspect the reason you see this issue with shared libraries is there is some symbol that exists in both the pykokkos-base library and the Kokkos library and pykokkos-base is initializing it's copy of the symbol instead of the one that exists in the Kokkos library. And when a static Kokkos library is used, these symbols get merged.

@jrmadsen Tried this, still had the same issue. I will take a look at nm when I have some time. Thanks!

kaschau avatar May 16 '23 17:05 kaschau

Commit that broke pybind11 : https://github.com/kokkos/kokkos/commit/1f048cfa5050149de9bb0662ebe11a6fdd86c080 And some info from valgrind (not very helpful)

==1994128== Invalid read of size 32
==1994128==    at 0x4FB9B89: __wcsncpy_avx2 (strncpy-avx2.S:306)
==1994128==    by 0x4B59439: UnknownInlinedFun (wchar2.h:146)
==1994128==    by 0x4B59439: _Py_wrealpath (fileutils.c:1996)
==1994128==    by 0x4B54A0C: _PyPathConfig_ComputeSysPath0.constprop.0 (pathconfig.c:495)
==1994128==    by 0x4B544F4: UnknownInlinedFun (main.c:575)
==1994128==    by 0x4B544F4: Py_RunMain (main.c:680)
==1994128==    by 0x4B1CF6A: Py_BytesMain (main.c:734)
==1994128==    by 0x4E7F84F: (below main) (libc_start_call_main.h:58)
==1994128==  Address 0x5ccb2a0 is 16 bytes after a block of size 176 in arena "client"

And python itself

ExecSpace Error: MDRange tile dims exceed maximum number of threads per block - choose smaller tile dims
Backtrace:
                                                               Kokkos::Impl::save_stacktrace() [0x7efc8e28d915]
Kokkos::Impl::traceback_callstack(std::__1::basic_ostream<char, std::__1::char_traits<char>>&) [0x7efc8e280cf1]
                                                         Kokkos::Impl::host_abort(char const*) [0x7efc8e280d98]
                                                                                               [0x7efc8e4f3696]
                                                                                               [0x7efc8e4f671c]
                                                                                               [0x7efc8e4f3e43]
                                                                                               [0x7efc8e4f2679]
                                                                                               [0x7efc8e4f1ebe]
                                                                                               [0x7efc8e4f1dba]
                                                                                               [0x7efc8e4f1cde]
                                                                                               [0x7efc8e4d9030]
                                                                                               [0x7efcafa04a81]
                                                                          _PyObject_MakeTpCall [0x7efcaf9e53e4]
                                                                                               [0x7efcafa360fe]
                                                                                               [0x7efcafa1d100]
                                                                                               [0x7efcaf9e575a]
                                                                                               [0x7efc8e4d3cdb]
                                                                          _PyObject_MakeTpCall [0x7efcaf9e53e4]
                                                                      _PyEval_EvalFrameDefault [0x7efcaf9efbcb]
                                                                                               [0x7efcafaa9f6a]
                                                                               PyEval_EvalCode [0x7efcafaa997c]
                                                                                               [0x7efcafac86b3]
                                                                                               [0x7efcafac43ba]
                                                                                               [0x7efcafadadd3]
                                                                       _PyRun_SimpleFileObject [0x7efcafad9ef4]
                                                                          _PyRun_AnyFileObject [0x7efcafad8de8]
                                                                                    Py_RunMain [0x7efcafad3722]
                                                                                  Py_BytesMain [0x7efcafa9bf6b]
                                                                                               [0x7efcaf639850]
                                                                             __libc_start_main [0x7efcaf63990a]
                                                                                        _start [0x55e4512bb045]

Yaraslaut avatar May 18 '23 21:05 Yaraslaut

I was trying to figure out what is going on in my case, and something very odd is happening since if i look at the addresses of this variable in here and here they are different. Good news is that if I fetch kokkos and pybind directly from pykokkos-base with using CPM

FetchContent_Declare(
  PyKokkosbase
  GIT_REPOSITORY https://github.com/kokkos/pykokkos-base.git
  GIT_TAG        94553b7e4be91b042baa9d903dc98e73722eeced
)
FetchContent_MakeAvailable(PyKokkosbase)
find_package(Python3 COMPONENTS Development)

..... 
pybind11_add_module(...)
target_link_libraries( ... Kokkos::kokkos)
.....

Everything starts to work properly by default kokkos 3.7 is used inside pykokkos-base , to check with kokkos 4.0 you can update submodule index

Yaraslaut avatar May 20 '23 09:05 Yaraslaut