pymatch
pymatch copied to clipboard
Match optimization
Additions:
- optimisation to the runtime of the match function, to avoid sorting on every iteration over the
test_scores
. Instead, thectrl_scores
variable is sorted once, optimising the search for values that are close to the value of the current iteration overtest_scores
. - added an optional
with_replacement
argument to the match function to allow the possiblity to match without replacement. - The threshold argument of the match function is now also used by the
'min'
or "nearest_neighbours" method, to discard matches that are not similar enough. This plays the role of a "caliper". -
find_nearest_n
function to encapsulate the logic for searching for nearest neighbours. Added a test for this function.
Hey @benmiroglio, could you please have a look at this PR?
Somehow this broke the matching when there are no exact matches.
from pymatch.Matcher import Matcher
import pandas as pd
cases_ages =[23, 21, 26, 25, 23, 44, 24, 22, 46, 26]
controls_ages = [34, 30, 24, 25, 25, 27, 30, 33, 53, 27, 26, 28, 23, 23, 28, 23, 24, 22, 23, 25]
cases_group = [1 for _ in range(len(cases_ages))]
controls_group = [0 for _ in range(len(cases_ages))]
df_cases = pd.DataFrame(list(zip(cases_ages, cases_group )), columns=['age', 'group'])
df_controls = pd.DataFrame(list(zip(controls_ages, controls_group )), columns=['age', 'group'])
m = Matcher(df_cases , df_controls , yvar='group')
m.fit_scores(balance=True, nmodels=100)
m.match(method='min', nmatches=1, with_replacement=False)
print(m.matched_data)
Only exact matches are printed now, which should not be the case.
@skjerns thanks for the feedback. This is caused by the threshold
argument which by default is 0.001. If you set this to a larger value you will get matches that are not exact. You can try
m.match(method='min', nmatches=1, with_replacement=False, threshold=0.1)
and then will get non exact matches.
However, you are right that the behavior should not break (that is, by default, when using the 'min'
matching method the threshold should be None
). The only way I can think of doing this would be following something like this: https://stackoverflow.com/questions/14749328/how-to-check-whether-optional-function-parameter-is-set/58166804#58166804. So checking whether or not the threshold is being explicitly set and when not, then passing None as a threshold to the find_nearest_n
function. Its not very elegant though. I would rather favor breaking the old behavior but bumping the version so that it is known that the old behavior is broken.