dowhy icon indicating copy to clipboard operation
dowhy copied to clipboard

Vectorize operations for propensity score matching

Open rahulbshrestha opened this issue 1 year ago • 2 comments

This PR addresses this issue by introducing vectorized operations instead of the existing for-loops. This should speed up operations for large datasets.

This PR is a work in progress, and the remaining tasks include:

  • [x] Add test cases to verify matching still works properly
  • [x] Further vectorize one of the arrays
  • [x] Cleanup

rahulbshrestha avatar May 12 '24 21:05 rahulbshrestha

Thanks for starting this, @rahulbshrestha . Let us know once the PR is ready for review.

amit-sharma avatar May 13 '24 16:05 amit-sharma

I ran some tests to check if the values of att and atc are the same before and after changes made in this PR:


### PREVIOUS IMPLEMENTATION
        att = 0
        numtreatedunits = treated.shape[0]
        treated_outcomes_old = []
        control_outcomes_old = []

        for i in range(numtreatedunits):

            treated_outcome = treated.iloc[i][self._target_estimand.outcome_variable[0]].item()
            control_outcome = control.iloc[indices[i]][self._target_estimand.outcome_variable[0]].item()
            treated_outcomes_old.append(treated_outcome)
            control_outcomes_old.append(control_outcome)
            att += treated_outcome - control_outcome

        att /= numtreatedunits


        print('Checking values of ATT: ')
        print('ATT (before): ', att)

        outcome_variable = self._target_estimand.outcome_variable[0]
        treated_outcomes = treated[outcome_variable]
        control_outcomes = list(control.iloc[indices.flatten()][outcome_variable])

        att = (treated_outcomes - control_outcomes).mean()

        print('ATT (after): ', att)
        print('Treated outcomes ', treated_outcomes_old == treated_outcomes)
        print('Control outcomes', control_outcomes_old == control_outcomes)

and the results when running on some test data:

Checking values of ATT: 
ATT (before):  10.923190922091228
ATT (after):  10.923190922091242
Treated outcomes  True
Control outcomes True
Checking values of ATC: 
ATC (before):  10.506587873468016
ATC (after):  10.506587873468012
Treated outcomes  True
Control outcomes True

Both lists, treated outcomes and control outcomes are the same before and after the changes I made. The ATT and ATC seems to be off by a couple digits after averaging (check last 3 digits in the example above), which is probably a rounding error. Is this a problem @amit-sharma?

rahulbshrestha avatar May 14 '24 12:05 rahulbshrestha

Hey @amit-sharma! I think this PR is ready to be merged :)

rahulbshrestha avatar May 31 '24 13:05 rahulbshrestha

@all-contributors please add @rahulbshrestha for code.

amit-sharma avatar Jun 04 '24 10:06 amit-sharma

@amit-sharma

I've put up a pull request to add @rahulbshrestha! :tada:

allcontributors[bot] avatar Jun 04 '24 10:06 allcontributors[bot]