simsopt icon indicating copy to clipboard operation
simsopt copied to clipboard

surface change_resolution method causes segmentation fault on subsequent evaluation of CPP methods

Open smiet opened this issue 10 months ago • 7 comments

When a Surface object's resolution is changed by the change_resolution method its sopp.Curve methods will access unallocated memory and cause segmentation faults.

MWE:

from simsopt.geo import SurfaceRZFourier
NFP = 5
s = SurfaceRZFourier.from_nphi_ntheta(mpol=5, ntor=5, stellsym=True, nfp=NFP)

s.darea_by_dcoeff() # successful evaluation of darea_by_dcoeff

s.change_resolution(7, 7)  # increase the resolution. 

s.darea_by_dcoeff()   # evaluation causes segmentation fault

causing the following error on my system:

munmap_chunk(): invalid pointer
[laika:07397] *** Process received signal ***
[laika:07397] Signal: Aborted (6)
[laika:07397] Signal code:  (-6)
[laika:07397] [ 0] /usr/lib/libc.so.6(+0x3dcd0) [0x7247d8153cd0]
[laika:07397] [ 1] /usr/lib/libc.so.6(+0x97624) [0x7247d81ad624]
[laika:07397] [ 2] /usr/lib/libc.so.6(gsignal+0x20) [0x7247d8153ba0]
[laika:07397] [ 3] /usr/lib/libc.so.6(abort+0x26) [0x7247d813b582]
[laika:07397] [ 4] /usr/lib/libc.so.6(+0x263bf) [0x7247d813c3bf]
[laika:07397] [ 5] /usr/lib/libc.so.6(+0xa1765) [0x7247d81b7765]
[laika:07397] [ 6] /usr/lib/libc.so.6(+0xa1abc) [0x7247d81b7abc]
[laika:07397] [ 7] /usr/lib/libc.so.6(__libc_free+0x158) [0x7247d81bca08]
[laika:07397] [ 8] /home/smiet/.pyenv/versions/3.11.11/lib/python3.11/site-packages/simsoptpp.cpython-311-x86_64-linux-gnu.so(+0x67a16) [0x7247cc467a16]
[laika:07397] [ 9] /home/smiet/.pyenv/versions/3.11.11/lib/python3.11/site-packages/simsoptpp.cpython-311-x86_64-linux-gnu.so(+0x61d5f) [0x7247cc461d5f]
[laika:07397] [10] /home/smiet/.pyenv/versions/3.11.11/lib/python3.11/site-packages/simsoptpp.cpython-311-x86_64-linux-gnu.so(+0x54fed) [0x7247cc454fed]
[laika:07397] [11] /home/smiet/.pyenv/versions/3.11.11/lib/libpython3.11.so.1.0(+0x1aab73) [0x7247d85aab73]
[laika:07397] [12] /home/smiet/.pyenv/versions/3.11.11/lib/libpython3.11.so.1.0(_PyObject_MakeTpCall+0x7d) [0x7247d8553d2d]
[laika:07397] [13] /home/smiet/.pyenv/versions/3.11.11/lib/libpython3.11.so.1.0(_PyEval_EvalFrameDefault+0x3079) [0x7247d865e539]
[laika:07397] [14] /home/smiet/.pyenv/versions/3.11.11/lib/libpython3.11.so.1.0(PyEval_EvalCode+0x230) [0x7247d8664e20]
[laika:07397] [15] /home/smiet/.pyenv/versions/3.11.11/lib/libpython3.11.so.1.0(+0x256210) [0x7247d8656210]
[laika:07397] [16] /home/smiet/.pyenv/versions/3.11.11/lib/libpython3.11.so.1.0(+0x1ab1d2) [0x7247d85ab1d2]
[laika:07397] [17] /home/smiet/.pyenv/versions/3.11.11/lib/libpython3.11.so.1.0(PyObject_Vectorcall+0x33) [0x7247d8554363]
[laika:07397] [18] /home/smiet/.pyenv/versions/3.11.11/lib/libpython3.11.so.1.0(_PyEval_EvalFrameDefault+0x3079) [0x7247d865e539]
[laika:07397] [19] /home/smiet/.pyenv/versions/3.11.11/lib/libpython3.11.so.1.0(+0x16d7a5) [0x7247d856d7a5]
[laika:07397] [20] /home/smiet/.pyenv/versions/3.11.11/lib/libpython3.11.so.1.0(_PyEval_EvalFrameDefault+0x7262) [0x7247d8662722]
[laika:07397] [21] /home/smiet/.pyenv/versions/3.11.11/lib/libpython3.11.so.1.0(+0x16d7a5) [0x7247d856d7a5]
[laika:07397] [22] /home/smiet/.pyenv/versions/3.11.11/lib/libpython3.11.so.1.0(_PyEval_EvalFrameDefault+0x7262) [0x7247d8662722]
[laika:07397] [23] /home/smiet/.pyenv/versions/3.11.11/lib/libpython3.11.so.1.0(+0x16e975) [0x7247d856e975]
[laika:07397] [24] /home/smiet/.pyenv/versions/3.11.11/lib/libpython3.11.so.1.0(+0x162004) [0x7247d8562004]
[laika:07397] [25] /home/smiet/.pyenv/versions/3.11.11/lib/libpython3.11.so.1.0(PyObject_Vectorcall+0x33) [0x7247d8554363]
[laika:07397] [26] /home/smiet/.pyenv/versions/3.11.11/lib/libpython3.11.so.1.0(_PyEval_EvalFrameDefault+0x3079) [0x7247d865e539]
[laika:07397] [27] /home/smiet/.pyenv/versions/3.11.11/lib/libpython3.11.so.1.0(+0x264f80) [0x7247d8664f80]
[laika:07397] [28] /home/smiet/.pyenv/versions/3.11.11/lib/libpython3.11.so.1.0(+0x156d12) [0x7247d8556d12]
[laika:07397] [29] /home/smiet/.pyenv/versions/3.11.11/lib/libpython3.11.so.1.0(_PyObject_Call+0x16d) [0x7247d8553b3d]
[laika:07397] *** End of error message ***
[1]    7397 IOT instruction (core dumped)  ipython

Likely the sopp.Curve's __init__ should be called again in the change_resolution method, but my initial attempt did not work, so leaving this for someone more knowledgeable in the CPP side of things to weigh in.

smiet avatar Mar 18 '25 13:03 smiet

It's working for me on a Mac and not crashing. What system are you using, so I can test similar one?

Image

mbkumar avatar May 07 '25 01:05 mbkumar

Hi Bharat,

I am using linux, compilation with GCC. The crash is intermittent, but seems to be related to out-of-boud memory access on the C++ side. Try this:

from simsopt.geo import SurfaceRZFourier
import numpy as np

a=[np.random.random(int(1e8)) for _ in range(10)] # fill memory to increase chance of collision

NFP = 5

s = SurfaceRZFourier.from_nphi_ntheta(mpol=5, ntor=5, stellsym=True, nfp=NFP)

s.darea_by_dcoeff() # successful evaluation of darea_by_dcoeff

s.change_resolution(70, 70)  # increase the resolution by much more

s.darea_by_dcoeff()

smiet avatar May 08 '25 08:05 smiet

I understand the source of the error now. I need to implement new functionality on the C++ side to fix this issue.

mbkumar avatar May 09 '25 22:05 mbkumar

I implemented the new functionality that I thought would fix this issue. But the issue still persists. My current understanding of the issue is that after the new functionality is added, pybind11 interface is causing issues. Some objects are being managed by both Python and C++ and are getting deleted by both. This results in different kinds of errors depending on how you structure your code. I have seen similar errors reported in other codes that use pybind11. I'll try to get a better understanding of the pybind11 interface and then see if I can find a fix for this problem.

mbkumar avatar May 13 '25 04:05 mbkumar

An alternative that can be used is copy the original surface object. The copy can be created by passing the new mpol and ntor values to SurfaceRZFourier.copy. If you can recreate the Optimizable graph with the newly created Surface object, then it can be used in place of the old one.

mbkumar avatar May 13 '25 04:05 mbkumar

I tried different pybind11 return value policies for SurfaceRZFourier.darea_by_dcoeff method, but none of them worked. I am not sure where the source of the error is, but I have a general idea of the issue. I'll keep pursuing a fix, but my recommendation for now is to use SurfaceRZFourier.copy in place of SurfaceRZFourier.change_resolution whenever derivatives are needed before and after change of resolution.

So the following code works

from simsopt.geo import SurfaceRZFourier

s = SurfaceRZFourier.from_nphi_ntheta(mpol=5, ntor=5, stellsym=True, nfp=5)
s.darea_by_dcoeff()
# s.change_resolution(mpol=70, ntor=70 # This is not working at the moment
s1 = s.copy(mpol=70, ntor=70)
s1.darea_by_coeff()

mbkumar avatar May 13 '25 16:05 mbkumar

@mbkumar I think that returning a copy with change_resolution is a perfectly good workaround, is consistent with SurfaceRZPseudospectral.

I made the necessary changes in PR #519

smiet avatar May 15 '25 15:05 smiet