openturns
openturns copied to clipboard
HSIC : p-values permutation factorization
This PR aims at factorizing part of the computations necessary to estimate the permutation-based HSIC p-values
recompile debian package: https://wiki.debian.org/BuildingTutorial#Get_the_source_package
Below are some useful links :
I found many more or less important ways to speed-up things with respect to your code. I started to implement it as a complement of your branch, but as I am really dumb with git I lost many things in the way, so I restarted a branch in my own repository... Here are the tricks, in increasing order of impact on the p-value computation by permutations:
- use std::accumulate to sum all the elements of a Matrix instead of a loop
- don't divide matrices twice by scalars, but divide once by their product
- don't store all the permuted versions of objects before to aggregate them, but create them and aggregate them on the fly
- don't discretize a permuted sample, but permute the discretization of the initial sample
- reorder the loops with the outer loop being the loop over the permutations, then parallelize the outer loop
This way, you transform the 18.3s to get the p-values by permutations (master) into a 2.3s (my branch) with 8 cores. I used this script for the benchmark:
import openturns as ot
from openturns.usecases.wingweight_function import WingWeightModel
from time import time
NREPS = 10
# To test the effect of TBB
#ot.TBB.Disable()
ot.Log.Show(ot.Log.NONE)
m = WingWeightModel()
# %%
# HSIC indices
# ------------
# %%
# We then estimate the HSIC indices using a data-driven approach.
sizeHSIC = 500
inputDesignHSIC = m.distributionX.getSample(sizeHSIC)
outputDesignHSIC = m.model(inputDesignHSIC)
covarianceModelCollection = []
# %%
for i in range(m.dim):
Xi = inputDesignHSIC.getMarginal(i)
inputCovariance = ot.SquaredExponential(1)
inputCovariance.setScale(Xi.computeStandardDeviation())
covarianceModelCollection.append(inputCovariance)
# %%
# We define a covariance kernel associated to the output variable.
outputCovariance = ot.SquaredExponential(1)
outputCovariance.setScale(outputDesignHSIC.computeStandardDeviation())
covarianceModelCollection.append(outputCovariance)
# %%
# In this paragraph, we perform the analysis on the raw data: that is
# the global HSIC estimator.
estimatorType = ot.HSICUStat()
# %%
# We now build the HSIC estimator:
t00 = time()
globHSIC = ot.HSICEstimatorGlobalSensitivity(
covarianceModelCollection, inputDesignHSIC, outputDesignHSIC, estimatorType
)
# %%
# We get the R2-HSIC indices:
R2HSICIndices = globHSIC.getR2HSICIndices()
print("\n Global HSIC analysis")
print("R2-HSIC Indices: ", R2HSICIndices)
# %%
# and the HSIC indices:
HSICIndices = globHSIC.getHSICIndices()
print("HSIC Indices: ", HSICIndices)
# %%
# The p-value by permutation, sequential
t0 = time()
for rep in range(NREPS):
ot.RandomGenerator.SetSeed(0)
globHSIC.setPermutationSize(globHSIC.getPermutationSize())
pvperm = globHSIC.getPValuesPermutation()
t1 = time()
print("p-value (permutation, seq):", pvperm)
print("t (perm)=", (t1 - t0) / NREPS, "s")
# %%
# The p-value by permutation, parallel
try:
t0 = time()
for rep in range(NREPS):
ot.RandomGenerator.SetSeed(0)
globHSIC.setPermutationSize(globHSIC.getPermutationSize())
pvperm = globHSIC.getPValuesPermutationParallel()
t1 = time()
print("p-value (permutation, par):", pvperm)
print("t (perm)=", (t1 - t0) / NREPS, "s")
except:
pass
# %%
# We have an asymptotic estimate of the value for this estimator.
pvas = globHSIC.getPValuesAsymptotic()
print("p-value (asymptotic): ", pvas)
t1 = time()
print("t (total)=", t1 - t00, "s")
Many other optimizations could be done, such as:
- Provide specialized algorithms for scalar matrix weights (HSICEstimatorGlobalSensitivity, HSICEstimatorTargetSensitivity), diagonal matrix weights (HSICEstimatorConditionalSensitivity) and symmetrical matrix weights (are there non symmetrical weights on the Earth?)
- Reorder some linear algebra
- Other things I will find using flamegraphs on the new implementation