optmatch icon indicating copy to clipboard operation
optmatch copied to clipboard

limit unexpected changes in Mahal. covariances

Open josherrickson opened this issue 5 years ago • 2 comments

This is a bit confusing, so bare with me. I came across this discrepancy:

> fullmatch(pr ~ cost, data = nuclearplants, 
            within = exactMatch(pr ~ pt, data = nuclearplants))[27:32]
  d   e   f   a   b   c 
1.2 1.3 1.1 1.1 1.2 1.3 
> fullmatch(pr ~ cost, data = nuclearplants[nuclearplants$pt == 1,])
  d   e   f   a   b   c 
1.2 1.1 1.3 1.1 1.2 1.3 

So in the first case I've matched within an exactMatch, in the second case I've just subset the data (on the same variable involved in the exactMatch). Note that e and f have swapped matches.

> em <- match_on(pr ~ cost, data = nuclearplants, 
                 within = exactMatch(pr ~ pt, data = nuclearplants))
> fullmatch(em)[27:32]
  a   b   c   d   e   f 
1.1 1.2 1.3 1.2 1.3 1.1 
> fullmatch(subproblems(em)[[2]])
  a   b   c   d   e   f 
1.1 1.2 1.3 1.2 1.1 1.3 

So the same distance produces different matches if we're considering a problem with subproblems, or manually extract out a subproblem.

I believe this is related to the tolerance.

> fullmatch(em, tol = .001001)[27:32]
  a   b   c   d   e   f 
1.1 1.2 1.3 1.2 1.1 1.3 

Note that in the Euclidean sense, both choice of matches are equivalent:

> match_on(pr ~ cost, data = nuclearplants[nuclearplants$pt == 1,], 
method = "euclidean")
         control
treatment     d     e     f
        a 72.85  8.12  4.52
        b  9.87 71.10 67.50
        c 63.20 17.77 14.17
> 14.17+8.12
[1] 22.29
> 4.52+17.77
[1] 22.29

I suspect this lays at the intersection of tolerance and Mahalanobis distance calculated with subproblems:

> findSubproblems(em)[[2]]
       control
treated          d          e          f
      a 0.42339969 0.04719294 0.02626996
      b 0.05736383 0.41322880 0.39230582
      c 0.36731449 0.10327814 0.08235516
> match_on(pr ~ cost, data = nuclearplants[nuclearplants$pt == 1,])
         control
treatment         d         e         f
        a 1.8090931 0.2016450 0.1122457
        b 0.2451029 1.7656352 1.6762359
        c 1.5694535 0.4412846 0.3518854
  1. Should a Mahalanobis distance on an exactMatch'd problem compute the distance across all observations, or within each subproblem? I was surprised to see those two distances being different.
  2. Why does a subproblem from a match on subproblems produce different results than a match on that subproblem alone (e.g. fullmatch(em)[27:32] vs fullmatch(subproblems(em)[[2]]))?
  3. It seems surprising to me that the default tolerance is so close to the inflection point for this problem. The tolerance of around 0.001000843 is where e & f switch (e.g. all tolerances above that see one match, all tolerances below it see another). Is this just a coincidence?
  4. Separately, not having messed much with tolerance before, I found the documentation in fullmatch somewhat confusing; should we add a sentence in the Arguments section (as opposed to the already extremely verbose details) explaining the default tolerance and the impact of raising or lowering it, as well as any limits (e.g. is a tolerance > 1 have some meaning)?

josherrickson avatar Jul 06 '19 00:07 josherrickson

The first example seems to be explained by calculating different covariances when using all samples versus just the those in the subgroup. I think this is sensible as a policy, if a little surprising from a user's perspective. The within argument does double duty both to express subproblems but also handle other kinds of restrictions (e.g., calipers on different variables). I don't know if there is a better way to handle this.

I agree that example two is more surprising. We do some things to tweak parameters over different subproblems. I wonder if that is happening here?

markmfredrickson avatar Jul 08 '19 13:07 markmfredrickson

  1. I agree w/ Mark that the Mahalanobis discrepancies are a function of using 2 different data frames to calculate the relevant covariances. This issue reared its head in #168 also, as the "bigger issue" I noted in this comment.
  2. With any distance, small changes in tolerance can lead to rather different matches. Also we've seen different matches being selected on different architectures, even from the very same inputs. However, in my experience (e.g. during our recent work on the node prices project), these rather different matches typically had very similar values on the optimization objective. This seems to be the case in Josh's example two (subproblem from a match on subproblems producing different results than a match on the subproblem alone) as well. Mimicking his comparison of Euclidean distances between the matches, but using the common Mahanalobis distance instead:
> sum( c(bd= 0.05736383, af= 0.02626996, ce= 0.10327814)  )
[1] 0.187
> sum( c(bd= 0.05736383, cf= 0.08235516, ae=0.04719294)  )
[1] 0.187

While I'm totally open to clarifications or enhancements of this piece of the docs, I don't see those as likely to make this source of confusion go away. (I hope to get more benefit in this regard from #173.) OTOH, this report adds to my sense that we should deal with issue of the Mahalanobis calculation changing b/c of seemingly irrelevant changes to the problem, in a fashion that's very difficult to exert control over. Accordingly I'd like to change the title of this issue; see below.

To my mind a next step in this direction is to give users a way to control how the Mahalanobis covariance is figured. I don't see this as self-explanatory; suggest a ftf among the 3 of us contributing to this thread. In the meantime I'm going to go ahead and change the issue's title from "Inconsistent matching" to "limit unexpected changes in Mahalanobis covariances" or similar; we can discuss this too as needed.

benthestatistician avatar Jul 08 '19 16:07 benthestatistician