cgal icon indicating copy to clipboard operation
cgal copied to clipboard

Multi-threaded execution of operations on polygons and polyhedra

Open dpapavas opened this issue 2 years ago • 12 comments

I'm experimenting with parallel (i.e. multithreaded) evaluation of various kinds o CGAL operations. I couldn't find much specific documentation on the matter; mainly:

  • it should be possible to use different objects in different threads at the same time (of the same type or not),
  • it is not safe to access the same object from different threads at the same time, unless otherwise specified in the class documentation.

and also:

Most operations on CGAL::Exact_predicates_exact_constructions_kernel objects are now thread-safe if CGAL::Exact_rational is mpq_class (from GMPXX), boost::multiprecision::mpq_rational or CGAL::QuotientCGAL::MP_Float. The objects are not atomic though, so the usual restrictions on avoiding race conditions apply. For users who do not use threads, this can be disabled with CGAL_HAS_NO_THREADS.

I have defined CGAL_HAS_THREADS and have tried using both a filtered Cartesian kernel and an EPEC kernel. I'm using GCC 10.2.1 on Linux with CGAL from an up-to-date master branch. Each operation is carried out in one of N threads, and accesses its operands (say a pair of Nef polyhedra to be joined) through const references. I'll list the various operations I've tried and the issues arising for some of them:

  1. Plain (i.e. segment traits) and circle-segment traits polygon operations. These include creation, transformation (in the sense of translation, etc.), regularized boolean operation via General_polygon_set_2 and convex hulls. These operations fail to execute properly when accessed from multiple threads in parallel, even through const references, as the documentation seems to imply (although I'll note in passing that I would interpret "the objects are not atomic though, so the usual restrictions on avoiding race conditions apply" to mean that read-only access, as through a const reference should be fine).

  2. Conic traits polygon operations. These are the same kind of operations as above, and seem to be executed properly when not accessed from multiple threads concurrently, as above (meaning that they produce the expected results) but they always crash when collected from the main process. The crash happens when the destructor is called and what you see below is printed. Is this a bug, or am I doing something wrong?

N4CORE12Realbase_forINS_6BigIntEEE
terminate called after throwing an instance of 'CGAL::Assertion_exception'
  what():  CGAL ERROR: assertion violation!
Expr: ! blocks.empty()
File: /home/dimitris/source/gc/cgal/CGAL_Core/include/CGAL/CORE/MemoryPool.h
Line: 125
Aborted
  1. Polyhedron operations, including creation, transformation, conversion between Polyhedron_3, Nef_polyhedron_3, Surface_mesh, boolean operations on Nefs, convex hulls and subdivisions (where the operand is first copied through a const reference and then subdivided). These seem to work fine even if the same object is accessed from different threads (through const references). I've tried various ways to get the multi-threaded execution to fail, with no success. I wonder therefore, whether these operations can in fact be safely accessed through const references from different threads at the same time. If so, it would be helpful to state this explicitly in the documentation, as the benefits can be considerable. More generally, do I understand correctly that concurrent read-only access to the same object, should always be considered unsafe, or would it be a good idea to test whether it seems to work and open an issue to inquire whether it is in fact safe and under what conditions (and possibly whether the documentation should be updated to reflect it)?

dpapavas avatar Nov 19 '21 20:11 dpapavas

Hello, thank you for experimenting with threads and CGAL. Do you think you could provide us with some testcases for 1 and 2 so we can reproduce the issue and investigate? For 3, well, we started with some basic things in this release (number, kernel), and we expect some higher-level packages will progressively test, fix and document their multi-thread behavior. It seems useful to me to test and document (through a github issue) what fails.

mglisse avatar Nov 20 '21 07:11 mglisse

Hi Marc, thanks for the quick response. Just a few questions to make sure I understand you (and the expected theading behavior of CGAL) correctly: You mention test responses for 1 and 2, but although 2, does indeed mention something that could be a bug (the failed assertion), 1, from what I understand is the expected behavior. Should I then understand that accessing two objects through const references from different threads concurrently should be expected to work wihtout problems?

Also in your response concerning 3, am I interpreting it correctly in that it mainly addresses my last question so that again, concurrent const reference access should indeed be expected to work (as I observe that it does for anything mesh/polyhedron-related) and anything that fails is worth bringing up?

dpapavas avatar Nov 20 '21 08:11 dpapavas

Should I then understand that accessing two objects through const references from different threads concurrently should be expected to work wihtout problems?

I think we are aiming for that (except in cases where it would be too hard), although I don't want to make promises in other people's name.

mglisse avatar Nov 20 '21 09:11 mglisse

Great. I understand and, of course, I don't expect any guarantees or promises. My main aim in asking is to inform further discussion and investigation and also to guide the design of my multi-threaded execution scheduler.

Regarding test cases for points 1 and 2 above, my current experimentation is part of a project that, essentially, is a wrapper around (parts of) CGAL, that allows execution of CGAL operations via a scripting language frontend - a sort of CGAL compiler if you will. I have set up test cases to test the functionality and have tried to extend them to cover threaded execution (i.e. try to make threaded execution fail, which they do in the case of 1 and 2 above, unless the operations are locked behind mutexes). My code is not published yet, but I wouldn't mind publishing it, if you wanted to try it out as is (it's going to be free software anyway). It's not particularly big, (a few thousand lines of code at the moment) and I could direct you in executing the tests and where the relevant CGAL-related stuff is.

Alternatively, I could try to tease out the relevant parts into a self-contained bare minimum test case. Although that'd be arguably best with respect to facilitating debugging, and hence my default choice, in this case it would be a relatively complex task as I'd have to reproduce the CGAL operations involving various kinds of polygons and the multi-threaded execution scheduler, so I wouldn't mind avoiding that. Let me know what you'd prefer.

dpapavas avatar Nov 20 '21 10:11 dpapavas

I may not be the one debugging those, so we can wait for others to chime in. For the assertion in CORE/MemoryPool.h, it seems this file may expect threads to work on different objects, and in particular it may not be happy with memory allocated in one thread and deallocated in another, but I have no idea if that's what's happening.

mglisse avatar Nov 21 '21 15:11 mglisse

We at GeometryFactory will probably be the ones ending debugging those issues.

@dpapavas Of course a self-contained minimal test case would be better for us, but we understand it can be difficult to set up. If you have complex code that can be compiled on our machines, and can be executed (with your instructions) to reproduce the issue then that will be sufficient for us: we will run the complex program in the debugger, and try and find the bugs using that.

lrineau avatar Nov 22 '21 11:11 lrineau

@lrineau I'll have a go at making self-contained test cases. Excuse the reiteration, but to avoid making tests for things that weren't bugs in the first place and since Marc wasn't entirely sure: the tests will try to show that polygon-related operations fail to produce correct results (at least consistently) when the same object is read, through const references from multiple threads concurrently and also that conic traits polygons (and presumably others that are based on CORE number types) additionally throw exceptions when the objects are allocated from one thread and freed from another. (That is, operations on conic traits polygons do seem to produce the correct results when the same object is not accessed concurrently in any way and only then, but they still throw an exception when they are collected.)

Do I understand correctly that this is not the expected behavior?

dpapavas avatar Nov 22 '21 16:11 dpapavas

I do not think anything was done, so far, to make CORE number types thread-safe. That is probably something we want, but it it expected to have issues with CORE and threads.

But, if you take CGAL::Exact_predicates_exact_constructions_kernel, then all its geometric objects and number types should be thread-safe and reference counted: they will be destroyed by whatever thread hold the last handle to them.

lrineau avatar Nov 23 '21 14:11 lrineau

I think the attached test code manifests all the failure modes I've seen in my application. (There are also some failures connected to more complex stuff like hulls made out of polygon points, transformations of circle and conic polygons etc., but I'm fairly confident, or at least hopeful, that the underlying problems are the same). The code contains 4 different test cases and I've also added some preprocessor machinery that allows you to easily switch on/off particular cases and also to serialize the threads, meaning that if CONCURRENT is defined to be 1, the code will still run in separate threads, but the threads will execute serially, so that no two theads ever run concurrently.

The test cases are:

  1. Segment polygon transformation and boolean joins. Mostly meant to test the joins, making sure that both the first and the second operands are concurrently accessed in turn. It is probably unimportant, but not that if only join operations that share the same first operand are run concurrently (by commenting out, say, threads tc and te), the failure rate drops to amost zero.
  2. Similar to the above but meant to test the translations, which are executed concurrenlty, while the joins then follow in the main thread. The result is tested by edge count, but it can also be drawn, to reveal that most of the time, some of the transformed polygons vanish.
  3. Boolean joins of circle-segment polygons. The result is tested via edge count.
  4. Same as above but implemented with conic polygons.

Note the following points:

  • All cases, execute properly if the threads are run sequentially. Case 4 still fails, as discussed, but it consistently fails in the same way and after the correct result has been asserted, when the conic polygons are destroyed.
  • Case 4 does not fail when destroying the polygons, if the initial polygons (p - t) are created in the main thread, even though the result polygons (t[]) are created in separate threads. (It still fails along with all others when run concurrently.)
  • When run concurrently, all cases fail in various ways. They can throw CGAL exceptions, produce segmentation faults, or simply fail to produce the correct results. Rarely, they may run correctly. A convenient way to see the full gamut of failure modes, (or make sure that there are none,) is with something like while true; do ./test && echo ok; done.

dpapavas avatar Nov 25 '21 21:11 dpapavas

I don't mean to hurry this or anything, but if anyone has had a chance to take a look at the test cases, I'd be interested in whether they can confirm that this is indeed abnormal behavior and whether it will be fixed at some point. My immediate concern and the reason I'm asking, is that this determines the necessary scheduling logic for multithreaded evaluation, which is at the heart of my application (and I suppose most multithreaded CGAL applications) and therefore largely affects the overall design.

dpapavas avatar Dec 12 '21 19:12 dpapavas

A reminder, in case this has been overlooked: any feedback on whether this is actually buggy behavior and if so, whether fixing it would be considered a priority, or not, would be very much appreciated.

dpapavas avatar Feb 25 '22 10:02 dpapavas

Pinging this, to let you know that I'm still very much interested in this. Please let me know if this is a bother.

dpapavas avatar May 26 '22 10:05 dpapavas