devito
devito copied to clipboard
Devito SIGSEGVs with TTI code
Hello,
I am using OpenMPI 4.1.2a1 and GCC 10.2.0 on a 120 core Zen2 system. The generated code is receiving SIGSEGV signals in what appears to be the C code generated by Devito (see below). Can U suggest ways to address these?
Any help would be much appreciated! Thank you
$ python -V
Python 3.8.11
$ env | grep DEVIT
DEVITO_LOGGING=INFO
DEVITO_PROFILING=advanced
DEVITO_MPI=1
DEVITO_DEVELOP=1
$ mpirun -np 1 python3 ./examples/FWI/example_tti_miket.py 20 1201 1201 601 1
...
[ccnpusc20000i8:85439:0:85439] Caught signal 11 (Segmentation fault: invalid permissions for mapped object at address 0x2b176ad8bf74)
/tmp/devito-jitcache-uid244579/aa8ae30b7609109f512fe5ca97ce79dd8e8f79fd.c: [ OpExampleTti() ]
...
248 #pragma omp simd aligned(b,eps,eta,f,m0,p0:32)
249 for (int z = z_m - 4; z <= z_M + 3; z += 1)
250 {
==> 251 float r64 = -fL0(x + 8, y + 8, z + 8);
252 float r63 = -m0L0(t0, x + 8, y + 8, z + 8);
253 float r62 = -p0L0(t0, x + 8, y + 8, z + 8);
254 float r61 = etaL0(x + 8, y + 8, z + 8)*etaL0(x + 8, y + 8, z + 8);
==== backtrace (tid: 85439) ====
0 0x0000000000007b1e OpExampleTti() /tmp/devito-jitcache-uid244579/aa8ae30b7609109f512fe5ca97ce79dd8e8f79fd.c:251
1 0x00000000000069dd ffi_call_unix64() :0
2 0x0000000000006067 ffi_call_int() ffi64.c:0
3 0x0000000000012d39 _call_function_pointer() /usr/local/src/conda/python-3.8.11/Modules/_ctypes/callproc.c:921
4 0x0000000000012d39 _ctypes_callproc() /usr/local/src/conda/python-3.8.11/Modules/_ctypes/callproc.c:1264
5 0x0000000000013708 PyCFuncPtr_call() /usr/local/src/conda/python-3.8.11/Modules/_ctypes/_ctypes.c:4201
6 0x0000000000137c5d PyObject_Call() /tmp/build/80754af9/python-split_1628000493704/work/Objects/call.c:246
7 0x00000000001d7abe do_call_core() /tmp/build/80754af9/python-split_1628000493704/work/Python/ceval.c:5010
8 0x00000000001d7abe _PyEval_EvalFrameDefault() /tmp/build/80754af9/python-split_1628000493704/work/Python/ceval.c:3559
9 0x00000000001ccf72 PyEval_EvalFrameEx() /tmp/build/80754af9/python-split_1628000493704/work/Python/ceval.c:741
10 0x00000000001cda44 _PyFunction_Vectorcall() /tmp/build/80754af9/python-split_1628000493704/work/Objects/call.c:436
11 0x000000000016690e _PyObject_Vectorcall() /tmp/build/80754af9/python-split_1628000493704/work/Include/cpython/abstract.h:127
12 0x000000000016690e _Py_CheckFunctionResult() /tmp/build/80754af9/python-split_1628000493704/work/Objects/call.c:25
13 0x000000000016690e _PyObject_Vectorcall() /tmp/build/80754af9/python-split_1628000493704/work/Include/cpython/abstract.h:128
14 0x000000000016690e method_vectorcall() /tmp/build/80754af9/python-split_1628000493704/work/Objects/classobject.c:60
15 0x00000000001d7159 _PyObject_Vectorcall() /tmp/build/80754af9/python-split_1628000493704/work/Include/cpython/abstract.h:127
16 0x00000000001d7159 _Py_CheckFunctionResult() /tmp/build/80754af9/python-split_1628000493704/work/Objects/call.c:25
17 0x00000000001d7159 _PyObject_Vectorcall() /tmp/build/80754af9/python-split_1628000493704/work/Include/cpython/abstract.h:128
18 0x00000000001d7159 call_function() /tmp/build/80754af9/python-split_1628000493704/work/Python/ceval.c:4963
19 0x00000000001d7159 _PyEval_EvalFrameDefault() /tmp/build/80754af9/python-split_1628000493704/work/Python/ceval.c:3515
20 0x00000000001cc480 PyEval_EvalFrameEx() /tmp/build/80754af9/python-split_1628000493704/work/Python/ceval.c:741
21 0x00000000001cdd33 PyEval_EvalCodeEx() /tmp/build/80754af9/python-split_1628000493704/work/Python/ceval.c:4327
22 0x00000000002414a2 run_eval_code_obj() /tmp/build/80754af9/python-split_1628000493704/work/Python/pythonrun.c:1166
23 0x0000000000252292 run_mod() /tmp/build/80754af9/python-split_1628000493704/work/Python/pythonrun.c:1188
24 0x0000000000252292 run_mod() /tmp/build/80754af9/python-split_1628000493704/work/Python/pythonrun.c:1189
25 0x000000000025542b pyrun_file() /tmp/build/80754af9/python-split_1628000493704/work/Python/pythonrun.c:1085
26 0x000000000025560f pyrun_simple_file() /tmp/build/80754af9/python-split_1628000493704/work/Python/pythonrun.c:439
27 0x000000000025560f PyRun_SimpleFileExFlags() /tmp/build/80754af9/python-split_1628000493704/work/Python/pythonrun.c:472
28 0x0000000000255ae9 pymain_run_file() /tmp/build/80754af9/python-split_1628000493704/work/Modules/main.c:391
29 0x0000000000255ae9 _Py_XDECREF() /tmp/build/80754af9/python-split_1628000493704/work/Include/object.h:541
30 0x0000000000255ae9 pymain_run_file() /tmp/build/80754af9/python-split_1628000493704/work/Modules/main.c:392
31 0x0000000000255ae9 pymain_run_python() /tmp/build/80754af9/python-split_1628000493704/work/Modules/main.c:616
32 0x0000000000255ae9 Py_RunMain() /tmp/build/80754af9/python-split_1628000493704/work/Modules/main.c:695
33 0x0000000000255ce9 Py_BytesMain() /tmp/build/80754af9/python-split_1628000493704/work/Modules/main.c:1127
34 0x0000000000022555 __libc_start_main() ???:0
35 0x00000000001f7847 _start() ???:0
=================================
[ccnpusc20000i8:85439] *** Process received signal ***
[ccnpusc20000i8:85439] Signal: Segmentation fault (11)
[ccnpusc20000i8:85439] Signal code: (-6)
[ccnpusc20000i8:85439] Failing at address: 0x3bb6300014dbf
[ccnpusc20000i8:85439] [ 0] /lib64/libpthread.so.0(+0xf630)[0x2b120d7f5630]
[ccnpusc20000i8:85439] [ 1] /tmp/devito-jitcache-uid244579/aa8ae30b7609109f512fe5ca97ce79dd8e8f79fd.so(OpExampleTti+0x69be)[0x2b1921588b1e]
[ccnpusc20000i8:85439] [ 2] /home/U/software/Anaconda/lib/python3.8/lib-dynload/../../libffi.so.7(+0x69dd)[0x2b120cef39dd]
[ccnpusc20000i8:85439] [ 3] /home/U/software/Anaconda/lib/python3.8/lib-dynload/../../libffi.so.7(+0x6067)[0x2b120cef3067]
[ccnpusc20000i8:85439] [ 4] /home/U/software/Anaconda/lib/python3.8/lib-dynload/_ctypes.cpython-38-x86_64-linux-gnu.so(_ctypes_callproc+0x319)[0x2b120d0b9d39]
[ccnpusc20000i8:85439] [ 5] /home/U/software/Anaconda/lib/python3.8/lib-dynload/_ctypes.cpython-38-x86_64-linux-gnu.so(+0x13708)[0x2b120d0ba708]
[ccnpusc20000i8:85439] [ 6] python3(PyObject_Call+0x45d)[0x563e5d015c5d]
[ccnpusc20000i8:85439] [ 7] python3(_PyEval_EvalFrameDefault+0x1f0e)[0x563e5d0b5abe]
[ccnpusc20000i8:85439] [ 8] python3(_PyEval_EvalCodeWithName+0xd52)[0x563e5d0aaf72]
[ccnpusc20000i8:85439] [ 9] python3(_PyFunction_Vectorcall+0x594)[0x563e5d0aba44]
[ccnpusc20000i8:85439] [10] python3(+0x16690e)[0x563e5d04490e]
[ccnpusc20000i8:85439] [11] python3(_PyEval_EvalFrameDefault+0x15a9)[0x563e5d0b5159]
[ccnpusc20000i8:85439] [12] python3(_PyEval_EvalCodeWithName+0x260)[0x563e5d0aa480]
[ccnpusc20000i8:85439] [13] python3(PyEval_EvalCode+0x23)[0x563e5d0abd33]
[ccnpusc20000i8:85439] [14] python3(+0x2414a2)[0x563e5d11f4a2]
[ccnpusc20000i8:85439] [15] python3(+0x252292)[0x563e5d130292]
[ccnpusc20000i8:85439] [16] python3(+0x25542b)[0x563e5d13342b]
[ccnpusc20000i8:85439] [17] python3(PyRun_SimpleFileExFlags+0x1bf)[0x563e5d13360f]
[ccnpusc20000i8:85439] [18] python3(Py_RunMain+0x3a9)[0x563e5d133ae9]
[ccnpusc20000i8:85439] [19] python3(Py_BytesMain+0x39)[0x563e5d133ce9]
[ccnpusc20000i8:85439] [20] /lib64/libc.so.6(__libc_start_main+0xf5)[0x2b120da24555]
[ccnpusc20000i8:85439] [21] python3(+0x1f7847)[0x563e5d0d5847]
[ccnpusc20000i8:85439] *** End of error message ***
--------------------------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 0 with PID 0 on node ccnpusc20000i8 exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------
Does it fail without MPI (I see you're using just one rank). I think it should given the error message.
Does it fail on CPUs. Or does it only fail on GPUs?
Try again disabling all performance optimizations (opt='noop'
) to Operator
Are you using padding
by any chance?
Anyway, this is an out-of-bounds array access. The only way I can practically help is if you write an MFE. Can you follow the instructions here and write one? Note that the MFE really needs to be minimal -- ie just a bunch of Python lines (typically 10-15 are enough) that trigger the issue once the Operator is run. Ideally, this is reproducible without GPUs and without MPI.
Thanks Fabio!
Yes, I am using padding. What is "weird" but telling is that running the same models with the same parameters some times works, other times crashes as above. This is a hint that there is memory corruption at run time likely due to the code accessing memory locations that is not supposed to.
I will try to put together an MFE.
Also recently started getting these messages, even though I haven't touched these parts of Devito :
OMP_NUM_THREADS=4 mpirun -x UCX_NET_DEVICES=mlx5_ib0:1 -n 1 --map-by L3cache:PE=4 python3 ./examples/FWI/example_tti.py 10 550 550 210 1
nt,nx,ny,nz,block; 10 550 550 210 1
nt,nx,ny,nz; 10 550 550 210
Operator `assign` ran in 0.04 s
Operator `assign` ran in 0.02 s
Operator `assign` ran in 0.02 s
Operator `assign` ran in 0.02 s
Operator `assign` ran in 0.02 s
Operator `assign` ran in 0.02 s
Operator `assign` ran in 0.02 s
Operator `assign` ran in 0.02 s
Traceback (most recent call last):
File "./examples/FWI/example_tti.py", line 56, in <module>
src = RickerSource(name='src', grid=grid, f0=fpeak, npoint=1, time_range=time_axis)
File "/home/mtml/cs691/performance/analysis/systems/ccNUMA/Intel64/ESD/devito/devito/types/basic.py", line 705, in __new__
newobj.__init_finalize__(*args, **kwargs)
File "/home/mtml/cs691/performance/analysis/systems/ccNUMA/Intel64/ESD/devito/examples/seismic/source.py", line 216, in __init_finalize__
self.data[:, p] = self.wavelet
File "/home/mtml/cs691/performance/analysis/systems/ccNUMA/Intel64/ESD/devito/devito/finite_differences/differentiable.py", line 149, in __getattr__
raise AttributeError("%r object has no attribute %r" % (self.__class__, name))
AttributeError: src object has no attribute 'data'
Thanks!
The above issue is purely on CPUs: AMD Zen2 and Zen3 (all 120-core Azure VMs).
Sorry to piggy-back on this :), but can how can I let Devito use a different toolchain to compile things? It picks up by default GNU but I could use Intel (eg 2021.03) or AOCC from AMD.
thanks ...
Running outside the MPI stack it generates the following:
$ python3 ./examples/FWI/example_tti_miket.py 10 550 550 210 1
nt,nx,ny,nz,block; 10 550 550 210 1
## nt,nx,ny,nz; 10 550 550 210
## AFID=TEST
Operator `assign` ran in 0.20 s
Operator `assign` ran in 0.26 s
Operator `assign` ran in 0.24 s
Operator `assign` ran in 0.24 s
Operator `assign` ran in 0.26 s
Operator `assign` ran in 0.26 s
Operator `assign` ran in 0.24 s
Operator `assign` ran in 0.24 s
Traceback (most recent call last):
File "./examples/FWI/example_tti_miket.py", line 80, in <module>
src = RickerSource(name='src', grid=grid, f0=fpeak, npoint=1, time_range=time_axis)
File "/home/U/cs691/performance/analysis/systems/ccNUMA/Intel64/ESD/devito/devito/types/basic.py", line 705, in __new__
newobj.__init_finalize__(*args, **kwargs)
File "/home/U/cs691/performance/analysis/systems/ccNUMA/Intel64/ESD/devito/examples/seismic/source.py", line 216, in __init_finalize__
self.data[:, p] = self.wavelet
File "/home/U/cs691/performance/analysis/systems/ccNUMA/Intel64/ESD/devito/devito/finite_differences/differentiable.py", line 149, in __getattr__
raise AttributeError("%r object has no attribute %r" % (self.__class__, name))
AttributeError: src object has no attribute 'data'
padding doesn't work with the linearize
pass yet, which you are using AFAICT. That should be the cause of your SIGSEGV
Sorry to piggy-back on this :), but can how can I let Devito use a different toolchain to compile things? It picks up by default GNU but I could use Intel (eg 2021.03) or AOCC from AMD.
Take a look at this: https://github.com/devitocodes/devito/wiki/FAQ#devito_arch
Let me know if the segfault disappears once you remove linearize
, in which case I'll close this issue
I set linearize='False'
...
./examples/FWI/example_tti_miket.py:op = Operator([stencil_p, stencil_m, src_term], opt=('advanced', {'min-storage': True, 'linearize': False}), subs=spacing_map, name='OpExampleTti')
...
but it is unfortunately giving me
"/home/U/cs691/performance/analysis/systems/ccNUMA/Intel64/ESD/devito/devito/finite_differences/differentiable.py", line 149, in __getattr__
raise AttributeError("%r object has no attribute %r" % (self.__class__, name))
AttributeError: src object has no attribute 'data'
It means that for some reason it could not allocate / initialize that object correct?
see below
$ python3 ./examples/FWI/example_tti.py 10 550 550 210 1
nt,nx,ny,nz,block; 10 550 550 210 1
nt,nx,ny,nz; 10 550 550 210
Operator `assign` ran in 0.20 s
Operator `assign` ran in 0.26 s
Operator `assign` ran in 0.26 s
Operator `assign` ran in 0.24 s
Operator `assign` ran in 0.24 s
Operator `assign` ran in 0.24 s
Operator `assign` ran in 0.24 s
Operator `assign` ran in 0.24 s
Traceback (most recent call last):
File "./examples/FWI/example_tti.py", line 56, in <module>
src = RickerSource(name='src', grid=grid, f0=fpeak, npoint=1, time_range=time_axis)
File "/home/U/cs691/performance/analysis/systems/ccNUMA/Intel64/ESD/devito/devito/types/basic.py", line 705, in __new__
newobj.__init_finalize__(*args, **kwargs)
File "/home/U/cs691/performance/analysis/systems/ccNUMA/Intel64/ESD/devito/examples/seismic/source.py", line 216, in __init_finalize__
self.data[:, p] = self.wavelet
File "/home/U/cs691/performance/analysis/systems/ccNUMA/Intel64/ESD/devito/devito/finite_differences/differentiable.py", line 149, in __getattr__
raise AttributeError("%r object has no attribute %r" % (self.__class__, name))
AttributeError: src object has no attribute 'data'
Can you suggest any way I can allow the code escape this behavior? I cannot put together the MFE. Should I clear any Python caching? thanks ...
I am attaching a .TXT file with the stderr/stdout of running one of the standard examples.py codes and a ZIP file with the two JiT *.c files Devito generated.
Note that this is not a TTI code but of we want to demonstrate the issues
Pls let me know if this MFE is sufficient.
Thank you! Michael
the fact that you've set linearize=False
definitely doesn't imply this error
"/home/U/cs691/performance/analysis/systems/ccNUMA/Intel64/ESD/devito/devito/finite_differences/differentiable.py", line 149, in getattr raise AttributeError("%r object has no attribute %r" % (self.class, name)) AttributeError: src object has no attribute 'data'
you must have introduced some other changes somewhere else. Or, at least, I can't really see how would that be possible.
I'm afraid but without an MFE there's virtually nothing I can do. You should penetrate the abstractions (that is, look inside the examples), change the equations/operator, and try writing that MFE
#1771 might be the cause of the original segfault
@drmichaelt7777 is this still an issue with the latest version of devito? if yes, would it be possible to get an MFE?
Thank you for your response. I am in the process of switching my Python environment out of Conda. Do you have a recommendation for a non-Conda Python environment ?
thanks! Michael
can we close this? I think this should have been fixed now @drmichaelt7777
Yes, please close it.
Thanks !