dowhy icon indicating copy to clipboard operation
dowhy copied to clipboard

p-value for estimator vs p-value for refutation of estimator with bootstrap

Open itsoum opened this issue 2 years ago • 1 comments

Hello,

Can you explain me what is the difference between the two tests? They looks too similar to me and they are give me the same results when I run locally the functions giving them hardcoded the estimate_value (of course in the causal estimator's case I run the code from # Processing the null hypothesis estimates and below ).

https://py-why.github.io/dowhy/v0.8/_modules/dowhy/causal_estimator.html#CausalEstimator

    def _test_significance_with_bootstrap(self, estimate_value, num_null_simulations=None):
        """ Test statistical significance of an estimate using the bootstrap method.

        :param estimate_value: Obtained estimate's value
        :param num_null_simulations: Number of simulations for the null hypothesis
        :returns: p-value of the statistical significance test.
        """
        # Use existing params, if new user defined params are not present
        if num_null_simulations is None:
            num_null_simulations = self.num_null_simulations
        do_retest = self._bootstrap_null_estimates is None or CausalEstimator.is_bootstrap_parameter_changed(
            self._bootstrap_null_estimates.params, locals())
        if do_retest:
            null_estimates = np.zeros(num_null_simulations)
            for i in range(num_null_simulations):
                new_outcome = np.random.permutation(self._outcome)
                new_data = self._data.assign(dummy_outcome=new_outcome)
                # self._outcome = self._data["dummy_outcome"]
                new_estimator = type(self)(
                    new_data,
                    self._target_estimand,
                    self._target_estimand.treatment_variable,
                    ("dummy_outcome",),
                    test_significance=False,
                    evaluate_effect_strength=False,
                    confidence_intervals=False,
                    target_units=self._target_units,
                    effect_modifiers=self._effect_modifier_names,
                    **self.method_params
                )
                new_effect = new_estimator.estimate_effect()
                null_estimates[i] = new_effect.value
            self._bootstrap_null_estimates = CausalEstimator.BootstrapEstimates(
                null_estimates,
                {'num_null_simulations': num_null_simulations, 'sample_size_fraction': 1})

        # Processing the null hypothesis estimates
        sorted_null_estimates = np.sort(self._bootstrap_null_estimates.estimates)
        self.logger.debug("Null estimates: {0}".format(sorted_null_estimates))
        median_estimate = sorted_null_estimates[int(num_null_simulations / 2)]
        # Doing a two-sided test
        if estimate_value > median_estimate:
            # Being conservative with the p-value reported
            estimate_index = np.searchsorted(sorted_null_estimates, estimate_value, side="left")
            p_value = 1 - (estimate_index / num_null_simulations)
        if estimate_value <= median_estimate:
            # Being conservative with the p-value reported
            estimate_index = np.searchsorted(sorted_null_estimates, estimate_value, side="right")
            p_value = (estimate_index / num_null_simulations)
        # If the estimate_index is 0, it depends on the number of simulations
        if p_value == 0:
            p_value = (0, 1 / len(sorted_null_estimates))  # a tuple determining the range.
        elif p_value == 1:
            p_value = (1 - 1 / len(sorted_null_estimates), 1)
        signif_dict = {
            'p_value': p_value
        }
        return signif_dict

https://py-why.github.io/dowhy/v0.8/_modules/dowhy/causal_refuter.html#CausalRefuter.test_significance

    def perform_bootstrap_test(self, estimate, simulations):

        # Get the number of simulations
        num_simulations = len(simulations)
        # Sort the simulations
        simulations.sort()
        # Obtain the median value
        median_refute_values = simulations[int(num_simulations/2)]

        # Performing a two sided test
        if estimate.value > median_refute_values:
            # np.searchsorted tells us the index if it were a part of the array
            # We select side to be left as we want to find the first value that matches
            estimate_index = np.searchsorted(simulations, estimate.value, side="left")
            # We subtact 1 as we are finding the value from the right tail
            p_value = 1 - (estimate_index/ num_simulations)
        else:
            # We take the side to be right as we want to find the last index that matches
            estimate_index = np.searchsorted(simulations, estimate.value, side="right")
            # We get the probability with respect to the left tail.
            p_value = estimate_index / num_simulations
        # return twice the determined quantile as this is a two sided test
        return 2*p_value

itsoum avatar Aug 11 '22 21:08 itsoum

They are the same :) Sometimes, it is useful to differentiate them. For example, some estimators (E.g., linear regression estimator) may use parametric confidence intervals which is fast, and then we may want to refute the analysis using a bootstrap p-value which makes no parametric assumptions.

But yeah, if you are already using the bootstrap method for testing significance, then the refutation by bootstrap is redundant.

amit-sharma avatar Aug 29 '22 04:08 amit-sharma

Closing this as the question seems to be answered. Please re-open if not. Thanks.

petergtz avatar Oct 14 '22 13:10 petergtz