geogram icon indicating copy to clipboard operation
geogram copied to clipboard

CSG operations sometimes crash under MacOS

Open BrunoLevy opened this issue 1 year ago • 11 comments

CSG operations sometimes crash under MacOS, not always.

This probably comes from the smaller stack size under MacOS, and some operations involving arithmetic expansions allocated on the stack (I guess that what we have is a stack overflow). Remember: stack is much smaller under MacOS than with other OSes.

This mostly happens in one of the CSG tests, which lets me think that one of the radial sort predicates still allocates expansions on the stack. Once this happened in one of the Intersect tests (three_cubes).

Note: this issue is mostly a duplicate of #112

BrunoLevy avatar Feb 05 '24 14:02 BrunoLevy

Ooops, did not want to close the issue, needs to be tested before. expansion::assign_product() allocates subproducts, on stack if number of components are smaller than MAX_CAPACITY_ON_STACK, that was set to 1024 -> decreased it to 512.

BrunoLevy avatar Feb 11 '24 06:02 BrunoLevy

Seems to fix the problem (at least for last run). Since the problem is not reproducible, let us wait for a couple of days before closing the issue...

BrunoLevy avatar Feb 11 '24 06:02 BrunoLevy

Launched the tests three times, and still got a crash... Will need finer analysis of the crash.

BrunoLevy avatar Feb 11 '24 10:02 BrunoLevy

Lowered max expansion size to 256

  • it seems that it no longer crashes (to be tested with multiple runs)
  • it assertion-fails (trying to allocate an expansion on stack with more than 256 components) on example_014.csg --> testing on my machine, with everything configured as if it was a mac.

OK, it was in expansion::compare() that computes an expansion difference and always stores it on the stack -> now stores the temporary expansion on the heap if the required capacity is larger than MAX_EXPANSION_ON_HEAP

BrunoLevy avatar Feb 12 '24 11:02 BrunoLevy

CSG seems to be OK on mac now, but we got another test that crashed on mac in debug mode, compute_RVD with tordu.mesh (still assertion failure capa <= MAX_CAPACITY_ON_STACK). I guess it is in one of the side1(), side2(), side3() predicates. It can be probably safely ignored, but let us investigate...

But it is weird we never had this one before ? -> it is because MAX_CAPACITY_ON_STACK was 512 rather than 256 !

OK still using my Linux box with all defines as if we were running with a small stack, it is in side3_exact_SOS() (not a big surprise), and we compute the product between an expansion of length 8 and one of length 19, which requires 2819=304 components. So 256 is not enough for all the predicates with doubles as input. Let us see now what happens with MAX_CAPACITY_ON_STACK set back to 512 (but I'm afraid we will still have some stack overflows coming back, let us see...).

BrunoLevy avatar Feb 12 '24 12:02 BrunoLevy

Nope, 512 is too much.

Well in fact we need to make a difference between:

  • when we need to switch from stack to heap
    • in expansion functions (product and compare)
    • in constructions in exact_geometry.h
  • when we throw an assertion fail because there was an unreasonable allocation of an expansion on the stack

BrunoLevy avatar Feb 12 '24 14:02 BrunoLevy

  • Back to 256 on Macs
  • Made assertion in bytes_on_stack() less stringent, because standard predicates can use expansions as long as 400-500 components (without any problem because there is a small number of them, they do not fill the stack)

Now testing...

BrunoLevy avatar Feb 12 '24 14:02 BrunoLevy

All light greens.... Let us test several times before closing the issue.

BrunoLevy avatar Feb 12 '24 15:02 BrunoLevy

Ran the test three times, seems to be OK, Let us wait for nightly builds before closing the issue...

BrunoLevy avatar Feb 12 '24 16:02 BrunoLevy

All lights green !

BrunoLevy avatar Feb 13 '24 05:02 BrunoLevy

Got a crash on mac with compute_CSG example_018.csg, Debug, Gargantua, this build

This one will be difficult to catch because it is not reproducible (re-launching the job did not re-trigger the bug)

BrunoLevy avatar Mar 26 '24 12:03 BrunoLevy

Added a stacktrace in the signal handler Problem seems to occur in simplify_coplanar_facets(), in the phase that runs in parallel, that uses a CoplanarFacets object, that creates some attributes. Maybe a race condition in the attributes manager ? But one of the stack traces had a call to expansion::assign_product()...

BrunoLevy avatar May 02 '24 21:05 BrunoLevy

It seems that AttributeStore::notify() mises a spinlock acquire/release.

BrunoLevy avatar May 03 '24 14:05 BrunoLevy

Seems to be fixed by acquiring/releasing the spinlock in AttributeStore::notify(). Before closing the issue, I need to make sure there are no wrong concurrent accesses to the mesh in simplify_coplanar_facets, in particular when new facets are created. The number of new facets is bounded (re-triangulating a polygon does not create more facets than the initial number of facets), so maybe I should just reserve twice the number of existing facets.

BrunoLevy avatar May 04 '24 05:05 BrunoLevy

Added MeshFacets::reserve() function called at the beginning of simplify_coplanar_facets()

BrunoLevy avatar May 04 '24 15:05 BrunoLevy

Seems to be fixed (summary):

  • solved concurrency issue in AttributeStore notify mechanism
  • pre-allocating upper bound of number of facets in simplify_coplanar_facets()

If the problem reappears:

  • double check that no facet allocation occurs in simplify_coplanar_facets() (add a readonly mechanism to MeshElements)

BrunoLevy avatar May 06 '24 06:05 BrunoLevy