Scan kernel cleanup problems ...
Hi, I am facing an annoying bug when cleaning up the memory after the run (core-dump). The bug occurs only on PPC64le architecture (IBM power9) when running on the GPU (Tesla V100) via the PortableCL driver. Nothing indicates pyopencl is the culprit but maybe you can help me in debugging and solving the issue. This issue is the same as: https://github.com/silx-kit/silx/issues/2629 I will provide the minimalist debug code needed to reproduce, but the hardest is to get access to the computer.
if len(sys.argv)<2:
target="cpu"
else:
target = sys.argv[1]
print("target: ", target)
import logging
logging.basicConfig(level=logging.DEBUG)
import pyopencl
print(pyopencl.version.VERSION)
from silx.opencl.codec.byte_offset import ByteOffset
import numpy
numpy.random.seed(0)
size = 1024
b=ByteOffset(size, size, devicetype=target)
print(b)
print(dir(b))
print(b.ctx)
data = numpy.random.randint(0, 65000, size=size)
c = b.encode(data)
d = b.decode(c.get().tostring())
print(d, d.dtype, type(d), d.size)
print(len(c))
print(type(c), type(d))
print(abs(d.get()-data).max())
b=None
import gc
gc.collect()
c=None
gc.collect()
print("end of python")
Here is the gdb output from pyopencl (the version with cffi ... it has more interesting debug information):
INFO:silx.opencl.codec.byte_offset:increase data input buffer size to 1024
INFO:silx.opencl.codec.byte_offset:increase compressed buffer size to 7168
INFO:silx.opencl.codec.byte_offset:increase raw buffer size to 5120
[ 2732 43567 42613 ... 22893 12393 40642] int32 <class 'pyopencl.array.Array'> 1024
4144
<class 'pyopencl.array.Array'> <class 'pyopencl.array.Array'>
0
free kernels
free(): invalid next size (fast)
Thread 1 "python" received signal SIGABRT, Aborted.
0x00007ffff7d7e98c in __libc_signal_restore_set (set=0x7fffffffbfd8) at ../sysdeps/unix/sysv/linux/nptl-signals.h:80
80 ../sysdeps/unix/sysv/linux/nptl-signals.h: No such file or directory.
and the backtrace ...
#1 __GI_raise (sig=<optimized out>) at ../sysdeps/unix/sysv/linux/raise.c:48
#2 0x00007ffff7d80be0 in __GI_abort () at abort.c:79
#3 0x00007ffff7dd08fc in __libc_message (action=<optimized out>, fmt=<optimized out>) at ../sysdeps/posix/libc_fatal.c:181
#4 0x00007ffff7ddabb8 in malloc_printerr (str=<optimized out>) at malloc.c:5350
#5 0x00007ffff7de454c in _int_free (have_lock=0, p=0x1213de50, av=0x7ffff7f50da8 <main_arena>) at malloc.c:4213
#6 __GI___libc_free (mem=0x1213de60) at malloc.c:3124
#7 0x00007ffed7e4ba70 in ?? () from /usr/lib/powerpc64le-linux-gnu/libpocl.so.2.2.0
#8 0x00007ffed9ac6e14 in clReleaseKernel () from /usr/lib/powerpc64le-linux-gnu/libOpenCL.so.1
#9 0x00007ffed9b5b294 in _call_func<int (*)(_cl_kernel*), 0, _cl_kernel* const&> (args=..., func=<optimized out>) at src/c_wrapper/function.h:43
#10 call_tuple<int (*&)(_cl_kernel*), std::tuple<_cl_kernel* const&> > (args=..., func=<synthetic pointer>: <optimized out>) at src/c_wrapper/function.h:48
#11 ArgPack<CLArg, _cl_kernel* const>::call<__CLArgGetter, int (*)(_cl_kernel*)> (func=<optimized out>, this=<synthetic pointer>) at src/c_wrapper/function.h:110
#12 CLArgPack<_cl_kernel* const>::clcall<int (*)(_cl_kernel*)> (name=0x7ffed9b67908 "clReleaseKernel", func=<optimized out>, this=<synthetic pointer>)
at src/c_wrapper/error.h:211
#13 call_guarded_cleanup<_cl_kernel*, _cl_kernel* const&> (name=0x7ffed9b67908 "clReleaseKernel", func=<optimized out>) at src/c_wrapper/error.h:281
#14 kernel::~kernel (this=0x12173800, __in_chrg=<optimized out>) at src/c_wrapper/kernel.cpp:19
#15 0x00007ffed9b5b670 in kernel::~kernel (this=<optimized out>, __in_chrg=<optimized out>) at src/c_wrapper/kernel.cpp:17
#16 0x00007ffed9b16e54 in clobj__delete (obj=<optimized out>) at src/c_wrapper/wrap_cl.cpp:118
#17 0x00007ffed9b160d0 in _cffi_f_clobj__delete (self=<optimized out>, arg0=<_cffi_backend.CData at remote 0x7ffec03a3e90>)
at build/temp.linux-ppc64le-3.6/pyopencl._cffi.cpp:1417
#18 0x0000000010110ea4 in _PyCFunction_FastCallDict (kwargs=<optimized out>, nargs=<optimized out>, args=0x7ffed94a6380,
func_obj=<built-in method clobj__delete of CompiledLib object at remote 0x7ffed9c01ef8>) at ../Objects/methodobject.c:209
#19 _PyCFunction_FastCallKeywords (kwnames=<optimized out>, nargs=<optimized out>, stack=<optimized out>, func=<optimized out>) at ../Objects/methodobject.c:294
#20 call_function (pp_stack=0x7fffffffc818, oparg=<optimized out>, kwnames=<optimized out>) at ../Python/ceval.c:4837
#21 0x0000000010116d58 in _PyEval_EvalFrameDefault (
f=Frame 0x7ffed94a61f8, for file /home/test/venv-kieffer_system/lib/python3.6/site-packages/pyopencl/cffi_cl.py, line 243, in __del__ (self=<Kernel(ptr=<_cffi_backend.CData at remote 0x7ffec03a3e90>, _source='\n#define local_barrier() barrier(CLK_LOCAL_MEM_FENCE);\n\n#define WITHIN_KERNEL /* empty */\n#define KERNEL __kernel\n#define GLOBAL_MEM __global\n#define LOCAL_MEM __local\n#define LOCAL_MEM_ARG __local\n#define REQD_WG_SIZE(X,Y,psc_Z) __attribute__((reqd_work_group_size(X, Y, psc_Z)))\n\n#define psc_LID_0 get_local_id(0)\n#define psc_LID_1 get_local_id(1)\n#define psc_LID_2 get_local_id(2)\n\n#define psc_GID_0 get_group_id(0)\n#define psc_GID_1 get_group_id(1)\n#define psc_GID_2 get_group_id(2)\n\n#define psc_LDIM_0 get_local_size(0)\n#define psc_LDIM_1 get_local_size(1)\n#define psc_LDIM_2 get_local_size(2)\n\n#define psc_GDIM_0 get_num_groups(0)\n#define psc_GDIM_1 get_num_groups(1)\n#define psc_GDIM_2 get_num_groups(2)\n\n #if __OPENCL_C_VERSION__ < 120\n #pragma OPENCL EXTENSION cl_khr_fp64: e...(truncated), throwflag=<optimized out>) at ../Python/ceval.c:3335
#22 0x000000001010ebfc in PyEval_EvalFrameEx (throwflag=0,
f=Frame 0x7ffed94a61f8, for file /home/test/venv-kieffer_system/lib/python3.6/site-packages/pyopencl/cffi_cl.py, line 243, in __del__ (self=<Kernel(ptr=<_cffi_backend.CData at remote 0x7ffec03a3e90>, _source='\n#define local_barrier() barrier(CLK_LOCAL_MEM_FENCE);\n\n#define WITHIN_KERNEL /* empty */\n#define KERNEL __kernel\n#define GLOBAL_MEM __global\n#define LOCAL_MEM __local\n#define LOCAL_MEM_ARG __local\n#define REQD_WG_SIZE(X,Y,psc_Z) __attribute__((reqd_work_group_size(X, Y, psc_Z)))\n\n#define psc_LID_0 get_local_id(0)\n#define psc_LID_1 get_local_id(1)\n#define psc_LID_2 get_local_id(2)\n\n#define psc_GID_0 get_group_id(0)\n#define psc_GID_1 get_group_id(1)\n#define psc_GID_2 get_group_id(2)\n\n#define psc_LDIM_0 get_local_size(0)\n#define psc_LDIM_1 get_local_size(1)\n#define psc_LDIM_2 get_local_size(2)\n\n#define psc_GDIM_0 get_num_groups(0)\n#define psc_GDIM_1 get_num_groups(1)\n#define psc_GDIM_2 get_num_groups(2)\n\n #if __OPENCL_C_VERSION__ < 120\n #pragma OPENCL EXTENSION cl_khr_fp64: e...(truncated)) at ../Python/ceval.c:754
And the same with a more recent version of pyopencl (2018.2.5):
#0 0x00007ffff7d7e98c in __libc_signal_restore_set (set=0x7fffffffc498) at ../sysdeps/unix/sysv/linux/nptl-signals.h:80
#1 __GI_raise (sig=<optimized out>) at ../sysdeps/unix/sysv/linux/raise.c:48
#2 0x00007ffff7d80be0 in __GI_abort () at abort.c:79
#3 0x00007ffff7dd08fc in __libc_message (action=<optimized out>, fmt=<optimized out>) at ../sysdeps/posix/libc_fatal.c:181
#4 0x00007ffff7ddabb8 in malloc_printerr (str=<optimized out>) at malloc.c:5350
#5 0x00007ffff7de454c in _int_free (have_lock=0, p=0x121893c0, av=0x7ffff7f50da8 <main_arena>) at malloc.c:4213
#6 __GI___libc_free (mem=0x121893d0) at malloc.c:3124
#7 0x00007ffed7e7ba70 in ?? () from /usr/lib/powerpc64le-linux-gnu/libpocl.so.2.2.0
#8 0x00007ffff72b6e14 in clReleaseKernel () from /usr/lib/powerpc64le-linux-gnu/libOpenCL.so.1
#9 0x00007ffff737e3a8 in pyopencl::kernel::~kernel (this=0x12192760, __in_chrg=<optimized out>) at src/wrap_cl.hpp:4238
#10 std::default_delete<pyopencl::kernel>::operator() (this=<optimized out>, __ptr=0x12192760) at /usr/include/c++/7/bits/unique_ptr.h:78
#11 std::unique_ptr<pyopencl::kernel, std::default_delete<pyopencl::kernel> >::~unique_ptr (this=<optimized out>, __in_chrg=<optimized out>)
at /usr/include/c++/7/bits/unique_ptr.h:268
#12 pybind11::class_<pyopencl::kernel>::dealloc (v_h=...) at /home/test/venv-kieffer_system/include/site/python3.6/pybind11/pybind11.h:1319
#13 0x00007ffff7322930 in pybind11::detail::clear_instance (
self=<pyopencl._cl.Kernel(_source='\n#define local_barrier() barrier(CLK_LOCAL_MEM_FENCE);\n\n#define WITHIN_KERNEL /* empty */\n#define KERNEL __kernel\n#define GLOBAL_MEM __global\n#define LOCAL_MEM __local\n#define LOCAL_MEM_ARG __local\n#define REQD_WG_SIZE(X,Y,psc_Z) __attribute__((reqd_work_group_size(X, Y, psc_Z)))\n\n#define psc_LID_0 get_local_id(0)\n#define psc_LID_1 get_local_id(1)\n#define psc_LID_2 get_local_id(2)\n\n#define psc_GID_0 get_group_id(0)\n#define psc_GID_1 get_group_id(1)\n#define psc_GID_2 get_group_id(2)\n\n#define psc_LDIM_0 get_local_size(0)\n#define psc_LDIM_1 get_local_size(1)\n#define psc_LDIM_2 get_local_size(2)\n\n#define psc_GDIM_0 get_num_groups(0)\n#define psc_GDIM_1 get_num_groups(1)\n#define psc_GDIM_2 get_num_groups(2)\n\n #if __OPENCL_C_VERSION__ < 120\n #pragma OPENCL EXTENSION cl_khr_fp64: enable\n #endif\n//CL//\n#define psc_WG_SIZE 256\n\n#define psc_SCAN_EXPR(a, b, across_seg_boundary) a+b\n#define psc_INPUT_EXPR(i) (interval_sums[i])\n\n\n int compres...(truncated)) at /home/test/venv-kieffer_system/include/site/python3.6/pybind11/detail/class.h:331
#14 pybind11::detail::pybind11_object_dealloc (
self=<pyopencl._cl.Kernel(_source='\n#define local_barrier() barrier(CLK_LOCAL_MEM_FENCE);\n\n#define WITHIN_KERNEL /* empty */\n#define KERNEL __kernel\n#define GLOBAL_MEM __global\n#define LOCAL_MEM __local\n#define LOCAL_MEM_ARG __local\n#define REQD_WG_SIZE(X,Y,psc_Z) __attribute__((reqd_work_group_size(X, Y, psc_Z)))\n\n#define psc_LID_0 get_local_id(0)\n#define psc_LID_1 get_local_id(1)\n#define psc_LID_2 get_local_id(2)\n\n#define psc_GID_0 get_group_id(0)\n#define psc_GID_1 get_group_id(1)\n#define psc_GID_2 get_group_id(2)\n\n#define psc_LDIM_0 get_local_size(0)\n#define psc_LDIM_1 get_local_size(1)\n#define psc_LDIM_2 get_local_size(2)\n\n#define psc_GDIM_0 get_num_groups(0)\n#define psc_GDIM_1 get_num_groups(1)\n#define psc_GDIM_2 get_num_groups(2)\n\n #if __OPENCL_C_VERSION__ < 120\n #pragma OPENCL EXTENSION cl_khr_fp64: enable\n #endif\n//CL//\n#define psc_WG_SIZE 256\n\n#define psc_SCAN_EXPR(a, b, across_seg_boundary) a+b\n#define psc_INPUT_EXPR(i) (interval_sums[i])\n\n\n int compres...(truncated)) at /home/test/venv-kieffer_system/include/site/python3.6/pybind11/detail/class.h:351
#15 0x00000000101983e4 in clear_slots (self=<optimized out>, type=0x11e7f868) at ../Objects/typeobject.c:1039
#16 subtype_dealloc (self=<_BuiltScanKernelInfo at remote 0x7ffec031f288>) at ../Objects/typeobject.c:1196
#17 0x00000000101cc9ec in free_keys_object (keys=0x1209f5a0) at ../Objects/dictobject.c:2027
#18 dict_dealloc (mp=0x7ffec03123f0) at ../Objects/dictobject.c:2025
#19 0x0000000010197e50 in subtype_dealloc (
self=<GenericScanKernel(context=<pyopencl._cl.Context at remote 0x7ffec06307d8>, dtype=<numpy.dtype at remote 0x7ffff6fc3220>, index_dtype=<numpy.dtype at remote 0x7ffff6fc3220>, devices=[<unknown at remote 0x7ffec03e1cb0>], options=[], parsed_args=[35182224605216, <VectorArg at remote 0x7ffec031d358>, <VectorArg at remote 0x7ffec031d3c8>], first_array_idx=0, input_expr='compressed_size((i == 0) ? data[0] : (data[i] - data[i - 1]))', is_segment_start_expr=None, is_segmented=False, output_statement='\n if (prev_item == 0) { // 1st thread store compressed data size\n size[0] = last_item;\n }\n write(prev_item, (i == 0) ? data[0] : (data[i] - data[i - 1]), compressed);\n ', input_fetch_exprs=[], name_prefix='scan', code_variables={'np': <module at remote 0x7ffff7443db8>, 'dtype_to_ctype': <method at remote 0x7ffed95b2688>, 'preamble': '\n int compressed_size(int diff) {\n int abs_diff = abs(diff);\n\n if (abs_diff < 128) {\n return 1;\...(truncated)) at ../Objects/typeobject.c:1207
The kernel used is defined in: https://github.com/kif/silx/blob/master/silx/opencl/codec/byte_offset.py#L256 but the bug occurs only when the 2 scan kernels are used.
I tried reproducing this, but I discovered it appears I need silx installed. Is pip install silx enough?
Le 18/06/2019 09:17, Andreas Klöckner a écrit :
I tried reproducing this, but I discovered it appears I need silx installed. Is pip install silx enough?
silx is indeed directly installable with pip install silx
But we don't provide (yet?) wheel packages on power9. Compilation is expected to be straight forwards. Numpy and cython are needed but this should be handled by recent pip.
We have added another test-case in silx which also uses scan (array sparsification/densification) and this one does not crash.