oce icon indicating copy to clipboard operation
oce copied to clipboard

Grand Dispatch Central for OSX multi threading

Open tpaviot opened this issue 11 years ago • 35 comments

Istvan recently posted the following topic on the official OCCT dev portal forum:

Apple released GCD about two years ago. GCD is a basically parallel programming toolkit, with super-efficient, lightweight asynchronous tools, like dispatch queues, semaphores, asynchronous I/O tools etc. Here is the reference: https://developer.apple.com/library/Mac/DOCUMENTATION/Performance/Reference/GCD_libdispatch_Ref/Reference/reference.html

I have created a patch, that uses dispatch_semaphore_t instead of pthread_mutex_t in Standard_Mutex, and it seems to give some performance improvement over the phtread_mutex implementation. I have attached the patch for further examination.

The other thing where it can be useful, is the replacement of Intel TBB parallel_for_each, with dispatch queues, they can be easily implemented, reducing the 3rd party dependencies of OCCT. I also attached a simple implementation of it. It uses std::function, which is a C++11 feature, but C++11 is supported by Apple since 10.7+ and iOS 5.0+, so this should not be a problem. Otherwise, it can be rewritten work just like TBB's parallel_for_each.

These should not be considered as tested, and reviewed codes, these rather ideas, and suggestions for using this state of the art technology where it is available.

Code is under the ic/osx-gcd-multithreading (https://github.com/tpaviot/oce/tree/ic/osx-gcd-multithreading) dev branch.

tpaviot avatar Dec 21 '13 11:12 tpaviot

It would be nice to add an option in CMakeLists.txt to enable using GCD (only on Apple, and enabled by default) so that people can easily build with and without GCD and compare performance.

dbarbier avatar Dec 23 '13 11:12 dbarbier

@dbarbier I agree. Enabling GCD by default on APPLE shoud however break compilation for older OSX versions. I don't know how to detect OSX Maverick with CMake, which would be the only way to safely enable GCD multithreading by default.

tpaviot avatar Dec 24 '13 09:12 tpaviot

Modifications to Standard_Mutex.cxx miss some changes to Standard_Mutex.hxx otherwise compilation fails. It's necessary to:

  1. include <dispatch/dispatch.h> to benefit from the dispatch_semaphore* functions/types
  2. include <libkern/OSAtomic.h> so that OSAtomicCompareAndSwap32Barrier is found
  3. define myMutex and myCount as private members of the Standard_Mutex class ;
  4. rewrite the UnLock method.

With commit 6fcd293, branch compiles fine. However, I don't know if it does the job as expected.

tpaviot avatar Dec 24 '13 13:12 tpaviot

Sorry I probably missed some patches. myCount should be volatile.

istvan-csanady avatar Dec 24 '13 13:12 istvan-csanady

GCD is part of OSX since 10.6, and it is the oldest supported version, isn't it?

istvan-csanady avatar Dec 24 '13 13:12 istvan-csanady

In the unlock function the semaphore has to be decremented not released, I will submit the correct version when I arrive home.

istvan-csanady avatar Dec 24 '13 13:12 istvan-csanady

@istvan-csanady since 10.6 right, I thought it was older (maybe some recent major update confused me about that). We will check your parallel_for_each function, if it works multithreading can be enabled by default on OSX.

tpaviot avatar Dec 24 '13 14:12 tpaviot

Btw the mutex implementation is based on this: http://www.mr-edd.co.uk/blog/sad_state_of_osx_pthread_mutex_t

However the attached source is not entirely correct as far is I remember, because the ciunter variable is not declared as volatile.

istvan-csanady avatar Dec 24 '13 14:12 istvan-csanady

Here are the updated files: https://www.cubby.com/pl/OCCT/_917a934c23ef42b0a1de1adedd637747

istvan-csanady avatar Dec 24 '13 21:12 istvan-csanady

@istvan-csanady thank you, I merged modified Standard_Mutex.cxx and hxx files into one single commit cda5aa5a7309f271f8ded473ed5b24b7b65120c4.

The next step are to:

  • add your MTAPIParallel cpp and header files to the Standard folder. I think we should rename those files to something more explicit ;
  • modify CMake files in order to compile and use, by default, this multi threading implementation on OSX ;
  • add one or more test cases into our testing suite to be able to check everything is going fine (and measure performance enhancements).

tpaviot avatar Dec 25 '13 09:12 tpaviot

Indeed, we should rename it, eg. Standard_ParallelFor? MTAPI is just a prefix that I use in a part of my project.

istvan-csanady avatar Dec 26 '13 15:12 istvan-csanady

@istvan-csanady I updated the branch with your MT* files and modified the cmake files. I patched FastDiscret/IncrmentalMesh with your parallel_for_each static method but I have the following compilation error for both BRepMesh_FastDiscret and BRepMesh_IncrementalMesh:

[ 30%] Building CXX object adm/cmake/TKMesh/CMakeFiles/TKMesh.dir/__/__/__/src/BRepMesh/BRepMesh_FastDiscret.cxx.o
/Users/thomas/Devel/oce/src/BRepMesh/BRepMesh_FastDiscret.cxx:270:54: error: 
      call to implicitly-deleted copy constructor of 'BRepMesh_FastDiscret'
    parallel_for_each (aFaces.begin(), aFaces.end(), *this);
                                                     ^~~~~
/Users/thomas/Devel/oce/inc/BRepMesh_FastDiscret.hxx:234:33: note: copy
      constructor of 'BRepMesh_FastDiscret' is implicitly deleted because field
      'myVertices' has an inaccessible copy constructor
BRepMesh_DataMapOfVertexInteger myVertices;

(I run the clang version provided with OSX Maverick).

tpaviot avatar Dec 26 '13 21:12 tpaviot

@tpaviot the usage of my parallel for each is slightly different from Intel TBBs, it uses a modern, C++11 based approach, the last argument is a lambda function. So to rewrite the code in BRepMesh_FastDiscret, you will have to write:

parallel_for_each(aFaces.begin(), aFaces.end(), [ this ](TopoDS_Face aFace) { (*this)(aFace); });

istvan-csanady avatar Dec 26 '13 21:12 istvan-csanady

However, it could be rewritten, to only take an object with an operator() defined. But this approach fits much better to C++'s current state.

istvan-csanady avatar Dec 26 '13 21:12 istvan-csanady

And just another comment on the parallel_for_each: with C++11 it is absolutely unnecessary to pull the whole Intel TBB library only because of the parallel_for_each. With std::thread such a parallel_for could be easily implemented. I will consider to write a prototype, and if I will, then parallelism could be turned on by default on every platform where C++11 is available. The only difficulty is to manage the lifetime of threads, and preventing the creation of new threads every time when a parallel_for_each is called. Probably we will need a thread pool.

istvan-csanady avatar Dec 26 '13 22:12 istvan-csanady

One can already build with openmp if TBB dependency is not wanted, I am not sure that we want to provide our own thread pool.

dbarbier avatar Dec 26 '13 22:12 dbarbier

@dbarbier Thats right, I forgot that OCC has an OpenMP based implementation as well. But since OpenMP is not supported by all the supported compilers, it would be reasonable to provide a parallel_for_each implementation. std::async could be a good choice, if we don't want to write a thread pool (that is not trivial of course).

istvan-csanady avatar Dec 26 '13 22:12 istvan-csanady

@istvan-csanady Your suggestion for the parallel for each call from BRepMesh_FastDiscret does not compile on my machine:

parallel_for_each(aFaces.begin(), aFaces.end(), [ this ](TopoDS_Face aFace) {(*this)(aFace);});

This produces the following output:

In file included from /Users/thomas/Devel/oce/src/BRepMesh/BRepMesh_FastDiscret.cxx:102:
/Users/thomas/Devel/oce/src/BRepMesh/MTAPIParallel.hxx:28:13: error: no matching
      function for call to object of type 'const <lambda at
      /Users/thomas/Devel/oce/src/BRepMesh/BRepMesh_FastDiscret.cxx:270:53>'
            function(iterator);
            ^~~~~~~~
/Users/thomas/Devel/oce/src/BRepMesh/BRepMesh_FastDiscret.cxx:270:5: note: in
      instantiation of function template specialization
      'parallel_for_each<std::__1::__wrap_iter<TopoDS_Face *>, <lambda at
      /Users/thomas/Devel/oce/src/BRepMesh/BRepMesh_FastDiscret.cxx:270:53> >'
      requested here
    parallel_for_each(aFaces.begin(), aFaces.end(), [ this ](TopoDS_Fac...
    ^
/Users/thomas/Devel/oce/src/BRepMesh/BRepMesh_FastDiscret.cxx:270:53: note: 
      candidate function not viable: no known conversion from 'const
      std::__1::__wrap_iter<TopoDS_Face *>' to 'TopoDS_Face' for 1st argument
    parallel_for_each(aFaces.begin(), aFaces.end(), [ this ](TopoDS_Fac...
                                                    ^
1 error generated.

tpaviot avatar Dec 27 '13 08:12 tpaviot

Sorry, here is the correct one... The argument of the lambda is an iterator In BRepMesh_FastDiscret: parallel_for_each(aFaces.begin(), aFaces.end(), [ this ](std::vector<TopoDS_Face>::iterator iter){ Process (*iter); });

Similarly in BRepMesh_IncrementalMesh: parallel_for_each(aFaces.begin(), aFaces.end(), [this](std::vector<TopoDS_Face>::iterator iter){ myMesh->Process (*iter); });

Of course the CreateMutexesForSubShapes call is necessary in both cases.

istvan-csanady avatar Dec 27 '13 11:12 istvan-csanady

@istvan-csanady I had to slightly modify what you suggest in order to compile TKMesh. Can you please review ff53abe ? Not tested yet.

tpaviot avatar Dec 27 '13 13:12 tpaviot

@tpaviot Seems to be OK.

istvan-csanady avatar Dec 27 '13 14:12 istvan-csanady

@istvan-csanady ok, that works. I added a simple test case (BRepMesh_IncrementalMesh for the famous cylinder_head geometry), see 5b31e63dc9cc7fb1469772f79a8891a15918cbb0.

Results of this test are reproduced below:

$ ./MultiThread_test 
[==========] Running 2 tests from 1 test case.
[----------] Global test environment set-up.
[----------] 2 tests from MultiThreadTestSuite
[ RUN      ] MultiThreadTestSuite.testMeshCylinderHeadSingleThread
[       OK ] MultiThreadTestSuite.testMeshCylinderHeadSingleThread (26665 ms)
[ RUN      ] MultiThreadTestSuite.testMeshCylinderHeadMultiThread
[       OK ] MultiThreadTestSuite.testMeshCylinderHeadMultiThread (22577 ms)
[----------] 2 tests from MultiThreadTestSuite (49242 ms total)

[----------] Global test environment tear-down
[==========] 2 tests from 1 test case ran. (49242 ms total)
[  PASSED  ] 2 tests.

Your GCD MT implementation results in an speed increase of 15.3%. This test was conducted on an old 2009 MacBoocPro 2.53GHz Intel Core 2 Duo/8Gb RAM. I'll try to do other tests and compare with Intel TBB performances. Note that both FastDiscret and IncrementalMesh are single threaded by default (the myInParallel boolean is set to Standard_False by default). Maybe we should set it to Standard_True by default on OSX/GCD. We may notice speed improvements in boolean operations.

tpaviot avatar Dec 27 '13 16:12 tpaviot

@tpaviot If I understand right, this test case calculates the mesh of a cylinder. I suggest to add a test case where a shape, that has many faces is triangulated, because the performance improvement could be measured better in such a shape.

istvan-csanady avatar Dec 27 '13 17:12 istvan-csanady

@istvan-csanady No, cylinder head is not just a simple cylinder: it's usually part of an engine (see http://www.opencascade.org/showroom/shapegallery/gal4/). It's a complex shape made of several faces, that's why I chose it for the use case.

tpaviot avatar Dec 27 '13 21:12 tpaviot

I ran the same test with Intel TBB:

$ ./MultiThread_test 
[==========] Running 2 tests from 1 test case.
[----------] Global test environment set-up.
[----------] 2 tests from MultiThreadTestSuite
[ RUN      ] MultiThreadTestSuite.testMeshCylinderHeadSingleThread
[       OK ] MultiThreadTestSuite.testMeshCylinderHeadSingleThread (26109 ms)
[ RUN      ] MultiThreadTestSuite.testMeshCylinderHeadMultiThread
[       OK ] MultiThreadTestSuite.testMeshCylinderHeadMultiThread (22301 ms)
[----------] 2 tests from MultiThreadTestSuite (48410 ms total)

[----------] Global test environment tear-down
[==========] 2 tests from 1 test case ran. (48410 ms total)
[  PASSED  ] 2 tests.

14.6% in time reduction. This is very close to the previous result, I need more tests to conclude about the compared benefits from using GCB or TBB.

tpaviot avatar Dec 27 '13 22:12 tpaviot

@istvan-csanady I finished merging your patch with current master branch. On Appel, MT is enabled by default for BRepMesh incremental and fast discrete meshes. If you don't want to use parallel meshing, you have to explicitely SetParallel your mesh to Standard_False. The test case use 2 complex shapes from the official occt gallery (cylinder head and F1). Do you have any other experimental code that could be added to the MT test case (not only meshing of course) ?

tpaviot avatar Jan 03 '14 14:01 tpaviot

@tpaviot No, I don't.

istvan-csanady avatar Jan 07 '14 12:01 istvan-csanady

@tpaviot Shouldn't we rename the MT* files to something that fits to the naming convention of OCC?

istvan-csanady avatar Jan 07 '14 12:01 istvan-csanady

@istvan-csanady I don't really know about OCC naming conventions, and so far I left your files as you provided them.

tpaviot avatar Jan 07 '14 14:01 tpaviot

Hello, I suggest to rename them into src/Standard/Standard_GrandCentralDispatch.[ch]xx

There are other problems with this branch:

  • Files contain this copyright notice:
  Copyright (c) 2013 icsanady. All rights reserved.

We are thus not allowed to distribute them. @istvan-csanady: if you wrote these files, can you please put a less restrictive copyright notice?

  • Declarations in MTAPIParallel.hxx depend on macro USE_GCD. If this header file is installed, users may want to use it, and they will have to define USE_GCD in their project too. It would be better to remove conditions
#if(defined(__APPLE__) && defined(USE_GCD))

from the hxx file, and provide a dummy sequential alternative in cxx when compiling without defining USE_GCD.

dbarbier avatar Jan 07 '14 15:01 dbarbier