optmatch icon indicating copy to clipboard operation
optmatch copied to clipboard

subdim() fails when given an empty BlockedInfinitySparseMatrix

Open benthestatistician opened this issue 7 years ago • 13 comments

Current problem: optmatch::subdim() fails w/ non-informative error when presented with a BISM having no eligible matches in any subclass. (See example in comments.) This in turn breaks fullmatch.matrix(), at line out.mean.controls <- mapply(<...>).

After some discussion, we've converged on a solution that changes subdim()'s signature in a way that makes it more friendly to certain usage scenarios; remains to implement this change.

Original issue statement: It can happen that findSubproblems filters out a subproblem, due it all of its distances being Inf, whereas subdim sees that subproblem as legitimate, due to there being positive numbers of treatment and control subjects in it. This throws a warning inside of fullmatch.matrix, here:

if (any(mean.controls > lapply(subdim(x), function(x) x[2]/x[1]), na.rm=TRUE))

benthestatistician avatar Jul 07 '17 17:07 benthestatistician

The warning in itself is easy to fix, going to something like

if (any(mean.controls > lapply(problems, function(p) {x <- subdim(p) ;  x[2]/x[1]}), na.rm=TRUE))

I.e., make sure subdim doesn't see the subproblems that subproblems has already deemed ignorable.

But there's another call to subdim a few lines following:

  if (any(!is.na(mean.controls) & is.na(omit.fraction))) {
    user.input.mean.controls <- TRUE
    omit.fraction <- 1 - mapply(function(x,y) x*y[1]/y[2], mean.controls, subdim(x))
  }

Should it get touched up in the same way?

benthestatistician avatar Jul 07 '17 18:07 benthestatistician

Any thoughts, @markmfredrickson @josherrickson ? (Wasn't so easy for me to track down which of you had written the relevant code; it was certainly a while back.)

benthestatistician avatar Jul 07 '17 18:07 benthestatistician

Started a topic branch, issue129, for further work on this. Good chance that what's there now can be pulled in more or less immediately, but there's an indirect relationship to another point I'd like to hear from Mark on (see comment to [master 403b9278]), so I think I'll post a merge request to him too.

benthestatistician avatar Jul 07 '17 19:07 benthestatistician

I've merged the commit and added the requested comment. Ready to close?

markmfredrickson avatar Jul 11 '17 14:07 markmfredrickson

I should add that all tests pass.

markmfredrickson avatar Jul 11 '17 14:07 markmfredrickson

Thanks, @markmfredrickson. Closing.

benthestatistician avatar Jul 11 '17 14:07 benthestatistician

Would anyone object to my changing the signature of subdim so that instead of spitting out a list, of length 2 integer vectors corresponding to subproblems, it were to spit out a data frame with 2 named columns and with rows corresponding to subproblems? (With compensatory adjustments wherever string "subdim" appears in R/ and tests/testthat/, of course.) The change is intended to make the function more friendly to users of the package.

I may implement something like this on the issue129 branch, as well as pulling in other recent changes from elsewhere. In the meantime, [issue129 eba2750] gives and example of the sort of end-user usage I would be hoping to support with this change. @markmfredrickson @josherrickson

benthestatistician avatar Jul 13 '17 21:07 benthestatistician

I have no overall objections. It might make sense to transpose your idea though? So it enables something like this:

> sd <- subdim(ISM)
> sd$prob1
[1] 5 2
> sd$prob2
[1] 8 8

Each column would be the output of dim(matrix). Seems to me that pulling out each individual dim will be more common than pulling out all row/columns across all subproblems. This would also mean your example code would not need modification.

One issue I see with this suggestion is what to do when there's one subproblem. Do you return a 2x1 dataframe? Or a vector? For your variation, you'd return a 1x2 dataframe which more closely mimics the vector returned by dim.

josherrickson avatar Jul 14 '17 11:07 josherrickson

Interesting. Relative to my proposal, Josh's has the advantages that

  1. Subproblem info is neatly extracted w "$", e.g. "sd$prob1" ;
  2. It's a smaller change to the function signature -- essentially returning the same things as before, but typed as data.frame rather than list.

Its disadvantages relative to my proposal are that 3. Info about numbers of controls by subclass, or numbers of treatments by subclass, can't be neatly extracted w/ "$"; 4. in order to calculate a vector of control-to-treatment ratios by subclass, the user needs an sapply() construction, instead of e.g. sd$num_treatments/sd$num_controls.

Because of (4), the usage scenario of my example favors my original proposal. But not by a lot.

Thanks for the comment, @josherrickson .

benthestatistician avatar Jul 14 '17 13:07 benthestatistician

Toying with Josh's idea a little: maybe it could slip into my usage scenario better if given suitable row names? I'm not sure if these are the best ones, but supposing it was c("treatments", "controls"),

subdims_em <- subdim(em)

m2 <- fullmatch(pr ~ t1 + t2, within=em, data=nuclearplants,
                mean.controls=pmin(1.5, subdims_em["controls",] / subdims_em["treatments",])
                 )

As compared to a $ -based syntax, there's still less user convenience and more room for user error than I'd like. But better than the sapply based construction in [issue129 eba2750], I think.

benthestatistician avatar Jul 14 '17 14:07 benthestatistician

The version of subdim() that we implemented can't handle BISMs with no eligible matches in any subclass.

 Z <- rep(c(0,1), 4)
  B <- rep(c("a", "b"), each=4)
  x <- c((1L:4L)/10, (1L:4L) *10)
  m <- exactMatch(Z ~ B)
  m <- match_on(Z ~ x, within=m, method="euclidean")
 m2 <- caliper(m, width=0.05)
 subdim(m2)
## Error in `.rowNamesDF<-`(x, value = value) : invalid 'row.names' length

the problem arises b/c this data frame has no columns, there being no valid nontrivial subproblems.

This wouldn't have come about had subdim() returned the transpose of what it now gives back, i.e. a data frame with a row for each valid subproblem. Then subdim() could simply have returned a data frame with no rows. I had suggested this above in the comment thread, then instead transposed at suggestion of @josherrickson . Now that I'm seeing this issue, I propose to revert to the form that I had proposed. In the meantime, feel free to speak up w/ any reservations or comments.

benthestatistician avatar Sep 15 '19 03:09 benthestatistician

@josherrickson any comments on my proposal just above? (It goes somewhat against a suggestion you made further up in the same comment thread, a couple years back.)

benthestatistician avatar Oct 03 '19 14:10 benthestatistician

I think the transpose makes sense.

josherrickson avatar Oct 03 '19 16:10 josherrickson