dowhy
dowhy copied to clipboard
p-value for estimator vs p-value for refutation of estimator with bootstrap
Hello,
Can you explain me what is the difference between the two tests? They looks too similar to me and they are give me the same results when I run locally the functions giving them hardcoded the estimate_value (of course in the causal estimator's case I run the code from # Processing the null hypothesis estimates
and below ).
https://py-why.github.io/dowhy/v0.8/_modules/dowhy/causal_estimator.html#CausalEstimator
def _test_significance_with_bootstrap(self, estimate_value, num_null_simulations=None):
""" Test statistical significance of an estimate using the bootstrap method.
:param estimate_value: Obtained estimate's value
:param num_null_simulations: Number of simulations for the null hypothesis
:returns: p-value of the statistical significance test.
"""
# Use existing params, if new user defined params are not present
if num_null_simulations is None:
num_null_simulations = self.num_null_simulations
do_retest = self._bootstrap_null_estimates is None or CausalEstimator.is_bootstrap_parameter_changed(
self._bootstrap_null_estimates.params, locals())
if do_retest:
null_estimates = np.zeros(num_null_simulations)
for i in range(num_null_simulations):
new_outcome = np.random.permutation(self._outcome)
new_data = self._data.assign(dummy_outcome=new_outcome)
# self._outcome = self._data["dummy_outcome"]
new_estimator = type(self)(
new_data,
self._target_estimand,
self._target_estimand.treatment_variable,
("dummy_outcome",),
test_significance=False,
evaluate_effect_strength=False,
confidence_intervals=False,
target_units=self._target_units,
effect_modifiers=self._effect_modifier_names,
**self.method_params
)
new_effect = new_estimator.estimate_effect()
null_estimates[i] = new_effect.value
self._bootstrap_null_estimates = CausalEstimator.BootstrapEstimates(
null_estimates,
{'num_null_simulations': num_null_simulations, 'sample_size_fraction': 1})
# Processing the null hypothesis estimates
sorted_null_estimates = np.sort(self._bootstrap_null_estimates.estimates)
self.logger.debug("Null estimates: {0}".format(sorted_null_estimates))
median_estimate = sorted_null_estimates[int(num_null_simulations / 2)]
# Doing a two-sided test
if estimate_value > median_estimate:
# Being conservative with the p-value reported
estimate_index = np.searchsorted(sorted_null_estimates, estimate_value, side="left")
p_value = 1 - (estimate_index / num_null_simulations)
if estimate_value <= median_estimate:
# Being conservative with the p-value reported
estimate_index = np.searchsorted(sorted_null_estimates, estimate_value, side="right")
p_value = (estimate_index / num_null_simulations)
# If the estimate_index is 0, it depends on the number of simulations
if p_value == 0:
p_value = (0, 1 / len(sorted_null_estimates)) # a tuple determining the range.
elif p_value == 1:
p_value = (1 - 1 / len(sorted_null_estimates), 1)
signif_dict = {
'p_value': p_value
}
return signif_dict
https://py-why.github.io/dowhy/v0.8/_modules/dowhy/causal_refuter.html#CausalRefuter.test_significance
def perform_bootstrap_test(self, estimate, simulations):
# Get the number of simulations
num_simulations = len(simulations)
# Sort the simulations
simulations.sort()
# Obtain the median value
median_refute_values = simulations[int(num_simulations/2)]
# Performing a two sided test
if estimate.value > median_refute_values:
# np.searchsorted tells us the index if it were a part of the array
# We select side to be left as we want to find the first value that matches
estimate_index = np.searchsorted(simulations, estimate.value, side="left")
# We subtact 1 as we are finding the value from the right tail
p_value = 1 - (estimate_index/ num_simulations)
else:
# We take the side to be right as we want to find the last index that matches
estimate_index = np.searchsorted(simulations, estimate.value, side="right")
# We get the probability with respect to the left tail.
p_value = estimate_index / num_simulations
# return twice the determined quantile as this is a two sided test
return 2*p_value
They are the same :) Sometimes, it is useful to differentiate them. For example, some estimators (E.g., linear regression estimator) may use parametric confidence intervals which is fast, and then we may want to refute the analysis using a bootstrap p-value which makes no parametric assumptions.
But yeah, if you are already using the bootstrap method for testing significance, then the refutation by bootstrap is redundant.
Closing this as the question seems to be answered. Please re-open if not. Thanks.