Added permutation parallelisation
This PR implements the factorization and parallelization of the computation of permutation-based p-values for the HSIC indices
@regislebrun, do you see any reason the parallelization of the computeWeightMatrix method would return slightly different results than the sequential alternative in a few rare cases? I have noticed slight differences in results when computing the conditional HSIC p-values indices over large numbers of permutations, and after a bit of digging, it comes down to the collection of weight matrices, that contains a few of the matrices that differ with respect some of their elements (if compared to the previous non-parallel implementation).
results can differ because of ordering of computations you should still get the same results when you force 1 thread though, else it's a bug
As @jschueller said, the arithmetic of floats is neither associative nor commutative, so depending on the order the values are aggregated in sums, the results will differ, but in well conditioned cases, this is harmless.
I understand that the order of computation can vary, but given that I'm creating a parallel routine with respect to indices, and not the computations themselves I should still obtain the same collection of weight matrices in the parallel and sequential cases for a given collection of samples, no?
Also, given that the computations that are within the parallelization routine involve products between matrices (which should be LAPACK reliant I think), do I need to de-activate the LAPACK parallelization flag for this specific? if so, could you tell me how to do so? The tests I ran by performing parallel matrix product with 500x500 matrix over a few hundreds TBB parallel threads seem to work fine and decently fast (which would suggest that the script does not create hundreds of LAPACK threads for each TBB thread associated to a permutation), but I would still rather make sure this does not create any weird issue on different setups
you mean openblas threads (lapack is not parallelized), that's already taken care of, see TBBContext class
do you have some benchmarking results ?
do you have some benchmarking results ?
Do you mean regarding the computation time performance (compared to the previous implementation), or regarding the difference in actual numerical results? Either way I do, but apart from actually providing some numbers, they are hard to show as they require compiling and running 2 modified versions of the old and new implementation.
However, I think I might have a possible explanation for the aforementioned issue. The tests that I've been running rely on a SymbolicFunction, which if I remember correctly is not 'thread safe'. Running other tests with a similar weight function but defined as a PythonFunction seems to produce correct results. However, trying to combine PythonFunctions and ComposedFunctions within a TBB routine makes the python kernel crash, so I have a new Mistery to solve
then it also depends on the parallel flag of your weight function
I have created 2 separate 'isParallel()' methods, one for the actual permutation parallelization, and one for the parallelization of the weight matrices computation. This allows to independently check whether the covariance model and the weight functions are thread safe.
edit : however, it seems I still have to master the art of the rebase. I apologize for the poorly named and ordered batch of commits
it starts to look good, does the tests still fail ?
For now they surely will, as by factorizing the generation of a single set of permutations for all output dimensions, the new version of the code will always produce slightly different results (due to the randomness of the permutation generation). I am currently validating the latest version on a hybrid solution to check whether all tests would pass, and I will then update the new tests on this branch.
I updated the tests, which now run without errors. However, they appear to time out on the MacOS integration machine. Any idea regarding what might be causing this? These tests should take a fraction of second when everything works smoothly.
I dont know, some kind of deadlock, maybe something else is not thread-safe here ?
Apart from the weight function and the covariance models methods, for which we check whether they are thread safe, the only other 'non trivial' computations that are contained in the parallel loop are the computeHSICIndex methods of the HSICUStat and HSICVstat classes, which again should only involve the covariance models methods as well as linear algebra computations.
Moreover, it seems weird that some non-thread safe operation we missed would only cause troubles on the MacOS tests.
After some digging, there is 1 of the 6 tests on HSICEstimator, pyinstallcheck_HSICEstimatorGlobalSensitivity_std, that actually goes through on MacOS (github). However it takes 88 seconds. The exact same python test takes roughly 2 seconds on Windows (github) and roughly 1 second on Linux (CircleCI), so there clearly is some issue there. Could it be related to a different handling of the TBB/openblas conflicts?
@sofianehaddad , maybe you have some insight regarding parallelization on MacOS?
Cheers
so not a deadlock but only a timeout ? it might be related to the way TBBContext disables openblas threading in a nested tbb parallel region since the new loops wrap the covariancemodel::discretize which is already parallelized with tbb maybe we should spare the innermost openblas calls ?
so not a deadlock but only a timeout ?
Yes, I believe it is an actual timeout due to a parallelization malfunction.
@sofianehaddad @jschueller @regislebrun @mbaudin47 If this works on all platforms except macOS, can it be merged? Is there a way to deactivate this on macOS specifically?
this is likely not macos-specific but rather due to the openblas code (we'll see this on linux too), see my previous comment
Agreed with @jschueller as mac-os implementation uses openblas in github-action. In your parallelisation process, you embed 2 covariance models and the discretize method of each is already parallel!
Could you suggest me a way of fixing this? thanks
I'd try to disable openblas threading at the global level in the cxx test main function and comment the tbbcontext instanciation in the tbb for loop to see if the timeout comes from that
OPENBLAS_NUMTHREADS=1 OMP_NUM_THREADS=1 ctest ...