openturns Added permutation parallelisation

This PR implements the factorization and parallelization of the computation of permutation-based p-values for the HSIC indices

Mar 14 '22 13:03 JPelamatti

@regislebrun, do you see any reason the parallelization of the computeWeightMatrix method would return slightly different results than the sequential alternative in a few rare cases? I have noticed slight differences in results when computing the conditional HSIC p-values indices over large numbers of permutations, and after a bit of digging, it comes down to the collection of weight matrices, that contains a few of the matrices that differ with respect some of their elements (if compared to the previous non-parallel implementation).

Mar 14 '22 13:03 JPelamatti

results can differ because of ordering of computations you should still get the same results when you force 1 thread though, else it's a bug

Mar 14 '22 13:03 jschueller

As @jschueller said, the arithmetic of floats is neither associative nor commutative, so depending on the order the values are aggregated in sums, the results will differ, but in well conditioned cases, this is harmless.

Mar 14 '22 13:03 regislebrun

I understand that the order of computation can vary, but given that I'm creating a parallel routine with respect to indices, and not the computations themselves I should still obtain the same collection of weight matrices in the parallel and sequential cases for a given collection of samples, no?

Mar 14 '22 14:03 JPelamatti

Also, given that the computations that are within the parallelization routine involve products between matrices (which should be LAPACK reliant I think), do I need to de-activate the LAPACK parallelization flag for this specific? if so, could you tell me how to do so? The tests I ran by performing parallel matrix product with 500x500 matrix over a few hundreds TBB parallel threads seem to work fine and decently fast (which would suggest that the script does not create hundreds of LAPACK threads for each TBB thread associated to a permutation), but I would still rather make sure this does not create any weird issue on different setups

Mar 14 '22 14:03 JPelamatti

you mean openblas threads (lapack is not parallelized), that's already taken care of, see TBBContext class

Mar 14 '22 14:03 jschueller

do you have some benchmarking results ?

Mar 14 '22 16:03 jschueller

do you have some benchmarking results ?

Do you mean regarding the computation time performance (compared to the previous implementation), or regarding the difference in actual numerical results? Either way I do, but apart from actually providing some numbers, they are hard to show as they require compiling and running 2 modified versions of the old and new implementation.

However, I think I might have a possible explanation for the aforementioned issue. The tests that I've been running rely on a SymbolicFunction, which if I remember correctly is not 'thread safe'. Running other tests with a similar weight function but defined as a PythonFunction seems to produce correct results. However, trying to combine PythonFunctions and ComposedFunctions within a TBB routine makes the python kernel crash, so I have a new Mistery to solve

Mar 14 '22 16:03 JPelamatti

then it also depends on the parallel flag of your weight function

Mar 14 '22 17:03 jschueller

I have created 2 separate 'isParallel()' methods, one for the actual permutation parallelization, and one for the parallelization of the weight matrices computation. This allows to independently check whether the covariance model and the weight functions are thread safe.

edit : however, it seems I still have to master the art of the rebase. I apologize for the poorly named and ordered batch of commits

Mar 15 '22 08:03 JPelamatti

it starts to look good, does the tests still fail ?

Mar 15 '22 08:03 jschueller

For now they surely will, as by factorizing the generation of a single set of permutations for all output dimensions, the new version of the code will always produce slightly different results (due to the randomness of the permutation generation). I am currently validating the latest version on a hybrid solution to check whether all tests would pass, and I will then update the new tests on this branch.

Mar 15 '22 08:03 JPelamatti

I updated the tests, which now run without errors. However, they appear to time out on the MacOS integration machine. Any idea regarding what might be causing this? These tests should take a fraction of second when everything works smoothly.

Mar 15 '22 12:03 JPelamatti

I dont know, some kind of deadlock, maybe something else is not thread-safe here ?

Mar 15 '22 12:03 jschueller

Apart from the weight function and the covariance models methods, for which we check whether they are thread safe, the only other 'non trivial' computations that are contained in the parallel loop are the computeHSICIndex methods of the HSICUStat and HSICVstat classes, which again should only involve the covariance models methods as well as linear algebra computations.

Moreover, it seems weird that some non-thread safe operation we missed would only cause troubles on the MacOS tests.

Mar 15 '22 14:03 JPelamatti

After some digging, there is 1 of the 6 tests on HSICEstimator, pyinstallcheck_HSICEstimatorGlobalSensitivity_std, that actually goes through on MacOS (github). However it takes 88 seconds. The exact same python test takes roughly 2 seconds on Windows (github) and roughly 1 second on Linux (CircleCI), so there clearly is some issue there. Could it be related to a different handling of the TBB/openblas conflicts? @sofianehaddad , maybe you have some insight regarding parallelization on MacOS? Cheers

Mar 15 '22 16:03 JPelamatti

so not a deadlock but only a timeout ? it might be related to the way TBBContext disables openblas threading in a nested tbb parallel region since the new loops wrap the covariancemodel::discretize which is already parallelized with tbb maybe we should spare the innermost openblas calls ?

Mar 15 '22 16:03 jschueller

so not a deadlock but only a timeout ?

Yes, I believe it is an actual timeout due to a parallelization malfunction.

Mar 15 '22 17:03 JPelamatti

@sofianehaddad @jschueller @regislebrun @mbaudin47 If this works on all platforms except macOS, can it be merged? Is there a way to deactivate this on macOS specifically?

Mar 28 '22 13:03 josephmure

this is likely not macos-specific but rather due to the openblas code (we'll see this on linux too), see my previous comment

Mar 28 '22 13:03 jschueller

Agreed with @jschueller as mac-os implementation uses openblas in github-action. In your parallelisation process, you embed 2 covariance models and the discretize method of each is already parallel!

Mar 29 '22 08:03 sofianehaddad

Could you suggest me a way of fixing this? thanks

Mar 29 '22 09:03 JPelamatti

I'd try to disable openblas threading at the global level in the cxx test main function and comment the tbbcontext instanciation in the tbb for loop to see if the timeout comes from that

Mar 29 '22 09:03 jschueller

OPENBLAS_NUMTHREADS=1 OMP_NUM_THREADS=1 ctest ...

Jul 07 '22 12:07 jschueller