devito icon indicating copy to clipboard operation
devito copied to clipboard

Devito SIGSEGVs with TTI code

Open drmichaelt7777 opened this issue 2 years ago • 11 comments

Hello,

I am using OpenMPI 4.1.2a1 and GCC 10.2.0 on a 120 core Zen2 system. The generated code is receiving SIGSEGV signals in what appears to be the C code generated by Devito (see below). Can U suggest ways to address these?

Any help would be much appreciated! Thank you

$ python -V
Python 3.8.11

$ env | grep DEVIT
DEVITO_LOGGING=INFO
DEVITO_PROFILING=advanced
DEVITO_MPI=1
DEVITO_DEVELOP=1

$ mpirun -np 1 python3 ./examples/FWI/example_tti_miket.py  20 1201 1201 601 1
...
[ccnpusc20000i8:85439:0:85439] Caught signal 11 (Segmentation fault: invalid permissions for mapped object at address 0x2b176ad8bf74)

/tmp/devito-jitcache-uid244579/aa8ae30b7609109f512fe5ca97ce79dd8e8f79fd.c: [ OpExampleTti() ]
      ...
      248             #pragma omp simd aligned(b,eps,eta,f,m0,p0:32)
      249             for (int z = z_m - 4; z <= z_M + 3; z += 1)
      250             {
==>   251               float r64 = -fL0(x + 8, y + 8, z + 8);
      252               float r63 = -m0L0(t0, x + 8, y + 8, z + 8);
      253               float r62 = -p0L0(t0, x + 8, y + 8, z + 8);
      254               float r61 = etaL0(x + 8, y + 8, z + 8)*etaL0(x + 8, y + 8, z + 8);

==== backtrace (tid:  85439) ====
 0 0x0000000000007b1e OpExampleTti()  /tmp/devito-jitcache-uid244579/aa8ae30b7609109f512fe5ca97ce79dd8e8f79fd.c:251
 1 0x00000000000069dd ffi_call_unix64()  :0
 2 0x0000000000006067 ffi_call_int()  ffi64.c:0
 3 0x0000000000012d39 _call_function_pointer()  /usr/local/src/conda/python-3.8.11/Modules/_ctypes/callproc.c:921
 4 0x0000000000012d39 _ctypes_callproc()  /usr/local/src/conda/python-3.8.11/Modules/_ctypes/callproc.c:1264
 5 0x0000000000013708 PyCFuncPtr_call()  /usr/local/src/conda/python-3.8.11/Modules/_ctypes/_ctypes.c:4201
 6 0x0000000000137c5d PyObject_Call()  /tmp/build/80754af9/python-split_1628000493704/work/Objects/call.c:246
 7 0x00000000001d7abe do_call_core()  /tmp/build/80754af9/python-split_1628000493704/work/Python/ceval.c:5010
 8 0x00000000001d7abe _PyEval_EvalFrameDefault()  /tmp/build/80754af9/python-split_1628000493704/work/Python/ceval.c:3559
 9 0x00000000001ccf72 PyEval_EvalFrameEx()  /tmp/build/80754af9/python-split_1628000493704/work/Python/ceval.c:741
10 0x00000000001cda44 _PyFunction_Vectorcall()  /tmp/build/80754af9/python-split_1628000493704/work/Objects/call.c:436
11 0x000000000016690e _PyObject_Vectorcall()  /tmp/build/80754af9/python-split_1628000493704/work/Include/cpython/abstract.h:127
12 0x000000000016690e _Py_CheckFunctionResult()  /tmp/build/80754af9/python-split_1628000493704/work/Objects/call.c:25
13 0x000000000016690e _PyObject_Vectorcall()  /tmp/build/80754af9/python-split_1628000493704/work/Include/cpython/abstract.h:128
14 0x000000000016690e method_vectorcall()  /tmp/build/80754af9/python-split_1628000493704/work/Objects/classobject.c:60
15 0x00000000001d7159 _PyObject_Vectorcall()  /tmp/build/80754af9/python-split_1628000493704/work/Include/cpython/abstract.h:127
16 0x00000000001d7159 _Py_CheckFunctionResult()  /tmp/build/80754af9/python-split_1628000493704/work/Objects/call.c:25
17 0x00000000001d7159 _PyObject_Vectorcall()  /tmp/build/80754af9/python-split_1628000493704/work/Include/cpython/abstract.h:128
18 0x00000000001d7159 call_function()  /tmp/build/80754af9/python-split_1628000493704/work/Python/ceval.c:4963
19 0x00000000001d7159 _PyEval_EvalFrameDefault()  /tmp/build/80754af9/python-split_1628000493704/work/Python/ceval.c:3515
20 0x00000000001cc480 PyEval_EvalFrameEx()  /tmp/build/80754af9/python-split_1628000493704/work/Python/ceval.c:741
21 0x00000000001cdd33 PyEval_EvalCodeEx()  /tmp/build/80754af9/python-split_1628000493704/work/Python/ceval.c:4327
22 0x00000000002414a2 run_eval_code_obj()  /tmp/build/80754af9/python-split_1628000493704/work/Python/pythonrun.c:1166
23 0x0000000000252292 run_mod()  /tmp/build/80754af9/python-split_1628000493704/work/Python/pythonrun.c:1188
24 0x0000000000252292 run_mod()  /tmp/build/80754af9/python-split_1628000493704/work/Python/pythonrun.c:1189
25 0x000000000025542b pyrun_file()  /tmp/build/80754af9/python-split_1628000493704/work/Python/pythonrun.c:1085
26 0x000000000025560f pyrun_simple_file()  /tmp/build/80754af9/python-split_1628000493704/work/Python/pythonrun.c:439
27 0x000000000025560f PyRun_SimpleFileExFlags()  /tmp/build/80754af9/python-split_1628000493704/work/Python/pythonrun.c:472
28 0x0000000000255ae9 pymain_run_file()  /tmp/build/80754af9/python-split_1628000493704/work/Modules/main.c:391
29 0x0000000000255ae9 _Py_XDECREF()  /tmp/build/80754af9/python-split_1628000493704/work/Include/object.h:541
30 0x0000000000255ae9 pymain_run_file()  /tmp/build/80754af9/python-split_1628000493704/work/Modules/main.c:392
31 0x0000000000255ae9 pymain_run_python()  /tmp/build/80754af9/python-split_1628000493704/work/Modules/main.c:616
32 0x0000000000255ae9 Py_RunMain()  /tmp/build/80754af9/python-split_1628000493704/work/Modules/main.c:695
33 0x0000000000255ce9 Py_BytesMain()  /tmp/build/80754af9/python-split_1628000493704/work/Modules/main.c:1127
34 0x0000000000022555 __libc_start_main()  ???:0
35 0x00000000001f7847 _start()  ???:0
=================================
[ccnpusc20000i8:85439] *** Process received signal ***
[ccnpusc20000i8:85439] Signal: Segmentation fault (11)
[ccnpusc20000i8:85439] Signal code:  (-6)
[ccnpusc20000i8:85439] Failing at address: 0x3bb6300014dbf
[ccnpusc20000i8:85439] [ 0] /lib64/libpthread.so.0(+0xf630)[0x2b120d7f5630]
[ccnpusc20000i8:85439] [ 1] /tmp/devito-jitcache-uid244579/aa8ae30b7609109f512fe5ca97ce79dd8e8f79fd.so(OpExampleTti+0x69be)[0x2b1921588b1e]
[ccnpusc20000i8:85439] [ 2] /home/U/software/Anaconda/lib/python3.8/lib-dynload/../../libffi.so.7(+0x69dd)[0x2b120cef39dd]
[ccnpusc20000i8:85439] [ 3] /home/U/software/Anaconda/lib/python3.8/lib-dynload/../../libffi.so.7(+0x6067)[0x2b120cef3067]
[ccnpusc20000i8:85439] [ 4] /home/U/software/Anaconda/lib/python3.8/lib-dynload/_ctypes.cpython-38-x86_64-linux-gnu.so(_ctypes_callproc+0x319)[0x2b120d0b9d39]
[ccnpusc20000i8:85439] [ 5] /home/U/software/Anaconda/lib/python3.8/lib-dynload/_ctypes.cpython-38-x86_64-linux-gnu.so(+0x13708)[0x2b120d0ba708]
[ccnpusc20000i8:85439] [ 6] python3(PyObject_Call+0x45d)[0x563e5d015c5d]
[ccnpusc20000i8:85439] [ 7] python3(_PyEval_EvalFrameDefault+0x1f0e)[0x563e5d0b5abe]
[ccnpusc20000i8:85439] [ 8] python3(_PyEval_EvalCodeWithName+0xd52)[0x563e5d0aaf72]
[ccnpusc20000i8:85439] [ 9] python3(_PyFunction_Vectorcall+0x594)[0x563e5d0aba44]
[ccnpusc20000i8:85439] [10] python3(+0x16690e)[0x563e5d04490e]
[ccnpusc20000i8:85439] [11] python3(_PyEval_EvalFrameDefault+0x15a9)[0x563e5d0b5159]
[ccnpusc20000i8:85439] [12] python3(_PyEval_EvalCodeWithName+0x260)[0x563e5d0aa480]
[ccnpusc20000i8:85439] [13] python3(PyEval_EvalCode+0x23)[0x563e5d0abd33]
[ccnpusc20000i8:85439] [14] python3(+0x2414a2)[0x563e5d11f4a2]
[ccnpusc20000i8:85439] [15] python3(+0x252292)[0x563e5d130292]
[ccnpusc20000i8:85439] [16] python3(+0x25542b)[0x563e5d13342b]
[ccnpusc20000i8:85439] [17] python3(PyRun_SimpleFileExFlags+0x1bf)[0x563e5d13360f]
[ccnpusc20000i8:85439] [18] python3(Py_RunMain+0x3a9)[0x563e5d133ae9]
[ccnpusc20000i8:85439] [19] python3(Py_BytesMain+0x39)[0x563e5d133ce9]
[ccnpusc20000i8:85439] [20] /lib64/libc.so.6(__libc_start_main+0xf5)[0x2b120da24555]
[ccnpusc20000i8:85439] [21] python3(+0x1f7847)[0x563e5d0d5847]
[ccnpusc20000i8:85439] *** End of error message ***
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 0 with PID 0 on node ccnpusc20000i8 exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------

drmichaelt7777 avatar Sep 28 '21 20:09 drmichaelt7777

Does it fail without MPI (I see you're using just one rank). I think it should given the error message.

Does it fail on CPUs. Or does it only fail on GPUs?

Try again disabling all performance optimizations (opt='noop') to Operator

Are you using padding by any chance?

Anyway, this is an out-of-bounds array access. The only way I can practically help is if you write an MFE. Can you follow the instructions here and write one? Note that the MFE really needs to be minimal -- ie just a bunch of Python lines (typically 10-15 are enough) that trigger the issue once the Operator is run. Ideally, this is reproducible without GPUs and without MPI.

FabioLuporini avatar Sep 29 '21 07:09 FabioLuporini

Thanks Fabio!

Yes, I am using padding. What is "weird" but telling is that running the same models with the same parameters some times works, other times crashes as above. This is a hint that there is memory corruption at run time likely due to the code accessing memory locations that is not supposed to.

I will try to put together an MFE.

Also recently started getting these messages, even though I haven't touched these parts of Devito :

OMP_NUM_THREADS=4 mpirun -x UCX_NET_DEVICES=mlx5_ib0:1  -n  1  --map-by L3cache:PE=4 python3 ./examples/FWI/example_tti.py  10 550 550 210  1
nt,nx,ny,nz,block;    10   550   550   210 1
nt,nx,ny,nz;  10 550 550 210
Operator `assign` ran in 0.04 s
Operator `assign` ran in 0.02 s
Operator `assign` ran in 0.02 s
Operator `assign` ran in 0.02 s
Operator `assign` ran in 0.02 s
Operator `assign` ran in 0.02 s
Operator `assign` ran in 0.02 s
Operator `assign` ran in 0.02 s
Traceback (most recent call last):
  File "./examples/FWI/example_tti.py", line 56, in <module>
    src = RickerSource(name='src', grid=grid, f0=fpeak, npoint=1, time_range=time_axis)
  File "/home/mtml/cs691/performance/analysis/systems/ccNUMA/Intel64/ESD/devito/devito/types/basic.py", line 705, in __new__
    newobj.__init_finalize__(*args, **kwargs)
  File "/home/mtml/cs691/performance/analysis/systems/ccNUMA/Intel64/ESD/devito/examples/seismic/source.py", line 216, in __init_finalize__
    self.data[:, p] = self.wavelet
  File "/home/mtml/cs691/performance/analysis/systems/ccNUMA/Intel64/ESD/devito/devito/finite_differences/differentiable.py", line 149, in __getattr__
    raise AttributeError("%r object has no attribute %r" % (self.__class__, name))
AttributeError: src object has no attribute 'data'

Thanks!

drmichaelt7777 avatar Sep 29 '21 15:09 drmichaelt7777

The above issue is purely on CPUs: AMD Zen2 and Zen3 (all 120-core Azure VMs).

Sorry to piggy-back on this :), but can how can I let Devito use a different toolchain to compile things? It picks up by default GNU but I could use Intel (eg 2021.03) or AOCC from AMD.

thanks ...

drmichaelt7777 avatar Sep 29 '21 15:09 drmichaelt7777

Running outside the MPI stack it generates the following:

$ python3 ./examples/FWI/example_tti_miket.py  10 550 550 210  1

nt,nx,ny,nz,block;    10   550   550   210 1
## nt,nx,ny,nz;  10 550 550 210
## AFID=TEST

Operator `assign` ran in 0.20 s
Operator `assign` ran in 0.26 s
Operator `assign` ran in 0.24 s
Operator `assign` ran in 0.24 s
Operator `assign` ran in 0.26 s
Operator `assign` ran in 0.26 s
Operator `assign` ran in 0.24 s
Operator `assign` ran in 0.24 s
Traceback (most recent call last):
  File "./examples/FWI/example_tti_miket.py", line 80, in <module>
    src = RickerSource(name='src', grid=grid, f0=fpeak, npoint=1, time_range=time_axis)
  File "/home/U/cs691/performance/analysis/systems/ccNUMA/Intel64/ESD/devito/devito/types/basic.py", line 705, in __new__
    newobj.__init_finalize__(*args, **kwargs)
  File "/home/U/cs691/performance/analysis/systems/ccNUMA/Intel64/ESD/devito/examples/seismic/source.py", line 216, in __init_finalize__
    self.data[:, p] = self.wavelet
  File "/home/U/cs691/performance/analysis/systems/ccNUMA/Intel64/ESD/devito/devito/finite_differences/differentiable.py", line 149, in __getattr__
    raise AttributeError("%r object has no attribute %r" % (self.__class__, name))
AttributeError: src object has no attribute 'data'

drmichaelt7777 avatar Sep 29 '21 15:09 drmichaelt7777

padding doesn't work with the linearize pass yet, which you are using AFAICT. That should be the cause of your SIGSEGV

Sorry to piggy-back on this :), but can how can I let Devito use a different toolchain to compile things? It picks up by default GNU but I could use Intel (eg 2021.03) or AOCC from AMD.

Take a look at this: https://github.com/devitocodes/devito/wiki/FAQ#devito_arch

Let me know if the segfault disappears once you remove linearize, in which case I'll close this issue

FabioLuporini avatar Sep 29 '21 15:09 FabioLuporini

I set linearize='False'

...
./examples/FWI/example_tti_miket.py:op = Operator([stencil_p, stencil_m, src_term], opt=('advanced', {'min-storage': True, 'linearize': False}), subs=spacing_map, name='OpExampleTti')
...

but it is unfortunately giving me

"/home/U/cs691/performance/analysis/systems/ccNUMA/Intel64/ESD/devito/devito/finite_differences/differentiable.py", line 149, in __getattr__
    raise AttributeError("%r object has no attribute %r" % (self.__class__, name))
AttributeError: src object has no attribute 'data'

It means that for some reason it could not allocate / initialize that object correct?

see below

$ python3 ./examples/FWI/example_tti.py  10 550 550 210  1
nt,nx,ny,nz,block;    10   550   550   210 1
nt,nx,ny,nz;  10 550 550 210
Operator `assign` ran in 0.20 s
Operator `assign` ran in 0.26 s
Operator `assign` ran in 0.26 s
Operator `assign` ran in 0.24 s
Operator `assign` ran in 0.24 s
Operator `assign` ran in 0.24 s
Operator `assign` ran in 0.24 s
Operator `assign` ran in 0.24 s
Traceback (most recent call last):
  File "./examples/FWI/example_tti.py", line 56, in <module>
    src = RickerSource(name='src', grid=grid, f0=fpeak, npoint=1, time_range=time_axis)
  File "/home/U/cs691/performance/analysis/systems/ccNUMA/Intel64/ESD/devito/devito/types/basic.py", line 705, in __new__
    newobj.__init_finalize__(*args, **kwargs)
  File "/home/U/cs691/performance/analysis/systems/ccNUMA/Intel64/ESD/devito/examples/seismic/source.py", line 216, in __init_finalize__
    self.data[:, p] = self.wavelet
  File "/home/U/cs691/performance/analysis/systems/ccNUMA/Intel64/ESD/devito/devito/finite_differences/differentiable.py", line 149, in __getattr__
    raise AttributeError("%r object has no attribute %r" % (self.__class__, name))
AttributeError: src object has no attribute 'data'

drmichaelt7777 avatar Sep 29 '21 16:09 drmichaelt7777

Can you suggest any way I can allow the code escape this behavior? I cannot put together the MFE. Should I clear any Python caching? thanks ...

drmichaelt7777 avatar Sep 30 '21 19:09 drmichaelt7777

I am attaching a .TXT file with the stderr/stdout of running one of the standard examples.py codes and a ZIP file with the two JiT *.c files Devito generated.

Note that this is not a TTI code but of we want to demonstrate the issues

Devito-crash.txt JIT_Code.ZIP

Pls let me know if this MFE is sufficient.

Thank you! Michael

drmichaelt7777 avatar Oct 04 '21 20:10 drmichaelt7777

the fact that you've set linearize=False definitely doesn't imply this error

"/home/U/cs691/performance/analysis/systems/ccNUMA/Intel64/ESD/devito/devito/finite_differences/differentiable.py", line 149, in getattr raise AttributeError("%r object has no attribute %r" % (self.class, name)) AttributeError: src object has no attribute 'data'

you must have introduced some other changes somewhere else. Or, at least, I can't really see how would that be possible.

I'm afraid but without an MFE there's virtually nothing I can do. You should penetrate the abstractions (that is, look inside the examples), change the equations/operator, and try writing that MFE

#1771 might be the cause of the original segfault

FabioLuporini avatar Oct 07 '21 07:10 FabioLuporini

@drmichaelt7777 is this still an issue with the latest version of devito? if yes, would it be possible to get an MFE?

FabioLuporini avatar Jan 10 '22 08:01 FabioLuporini

Thank you for your response. I am in the process of switching my Python environment out of Conda. Do you have a recommendation for a non-Conda Python environment ?

thanks! Michael

drmichaelt7777 avatar Jan 10 '22 17:01 drmichaelt7777

can we close this? I think this should have been fixed now @drmichaelt7777

FabioLuporini avatar Nov 08 '22 08:11 FabioLuporini

Yes, please close it.

Thanks !

drmichaelt7777 avatar Nov 09 '22 19:11 drmichaelt7777