pymatch icon indicating copy to clipboard operation
pymatch copied to clipboard

Match optimization

Open tammymendt opened this issue 5 years ago • 3 comments

Additions:

  • optimisation to the runtime of the match function, to avoid sorting on every iteration over the test_scores. Instead, the ctrl_scores variable is sorted once, optimising the search for values that are close to the value of the current iteration over test_scores.
  • added an optional with_replacement argument to the match function to allow the possiblity to match without replacement.
  • The threshold argument of the match function is now also used by the 'min' or "nearest_neighbours" method, to discard matches that are not similar enough. This plays the role of a "caliper".
  • find_nearest_n function to encapsulate the logic for searching for nearest neighbours. Added a test for this function.

tammymendt avatar Apr 05 '19 09:04 tammymendt

Hey @benmiroglio, could you please have a look at this PR?

tammymendt avatar May 07 '20 07:05 tammymendt

Somehow this broke the matching when there are no exact matches.

from pymatch.Matcher import Matcher
import pandas as pd

cases_ages =[23, 21, 26, 25, 23, 44, 24, 22, 46, 26]
controls_ages = [34, 30, 24, 25, 25, 27, 30, 33, 53, 27, 26, 28, 23, 23, 28, 23, 24, 22, 23, 25]
cases_group = [1 for _ in range(len(cases_ages))]
controls_group = [0 for _ in range(len(cases_ages))]

df_cases = pd.DataFrame(list(zip(cases_ages, cases_group )), columns=['age', 'group'])
df_controls = pd.DataFrame(list(zip(controls_ages, controls_group )), columns=['age', 'group'])

m = Matcher(df_cases , df_controls , yvar='group')
m.fit_scores(balance=True, nmodels=100)
m.match(method='min', nmatches=1, with_replacement=False)
print(m.matched_data)

Only exact matches are printed now, which should not be the case.

skjerns avatar May 17 '20 10:05 skjerns

@skjerns thanks for the feedback. This is caused by the threshold argument which by default is 0.001. If you set this to a larger value you will get matches that are not exact. You can try m.match(method='min', nmatches=1, with_replacement=False, threshold=0.1) and then will get non exact matches.

However, you are right that the behavior should not break (that is, by default, when using the 'min' matching method the threshold should be None). The only way I can think of doing this would be following something like this: https://stackoverflow.com/questions/14749328/how-to-check-whether-optional-function-parameter-is-set/58166804#58166804. So checking whether or not the threshold is being explicitly set and when not, then passing None as a threshold to the find_nearest_n function. Its not very elegant though. I would rather favor breaking the old behavior but bumping the version so that it is known that the old behavior is broken.

tammymendt avatar May 18 '20 15:05 tammymendt