optmatch
optmatch copied to clipboard
subdim() fails when given an empty BlockedInfinitySparseMatrix
Current problem: optmatch::subdim()
fails w/ non-informative error when presented with a BISM having no eligible matches in any subclass. (See example in comments.) This in turn breaks fullmatch.matrix()
, at line out.mean.controls <- mapply(<...>)
.
After some discussion, we've converged on a solution that changes subdim()
's signature in a way that makes it more friendly to certain usage scenarios; remains to implement this change.
Original issue statement:
It can happen that findSubproblems
filters out a subproblem, due it all of its distances being Inf, whereas subdim sees that subproblem as legitimate, due to there being positive numbers of treatment and control subjects in it. This throws a warning inside of fullmatch.matrix
, here:
if (any(mean.controls > lapply(subdim(x), function(x) x[2]/x[1]), na.rm=TRUE))
The warning in itself is easy to fix, going to something like
if (any(mean.controls > lapply(problems, function(p) {x <- subdim(p) ; x[2]/x[1]}), na.rm=TRUE))
I.e., make sure subdim
doesn't see the subproblems that subproblems
has already deemed ignorable.
But there's another call to subdim a few lines following:
if (any(!is.na(mean.controls) & is.na(omit.fraction))) {
user.input.mean.controls <- TRUE
omit.fraction <- 1 - mapply(function(x,y) x*y[1]/y[2], mean.controls, subdim(x))
}
Should it get touched up in the same way?
Any thoughts, @markmfredrickson @josherrickson ? (Wasn't so easy for me to track down which of you had written the relevant code; it was certainly a while back.)
Started a topic branch, issue129, for further work on this. Good chance that what's there now can be pulled in more or less immediately, but there's an indirect relationship to another point I'd like to hear from Mark on (see comment to [master 403b9278]), so I think I'll post a merge request to him too.
I've merged the commit and added the requested comment. Ready to close?
I should add that all tests pass.
Thanks, @markmfredrickson. Closing.
Would anyone object to my changing the signature of subdim
so that instead of spitting out a list, of length 2 integer vectors corresponding to subproblems, it were to spit out a data frame with 2 named columns and with rows corresponding to subproblems? (With compensatory adjustments wherever string "subdim" appears in R/ and tests/testthat/, of course.) The change is intended to make the function more friendly to users of the package.
I may implement something like this on the issue129 branch, as well as pulling in other recent changes from elsewhere. In the meantime, [issue129 eba2750] gives and example of the sort of end-user usage I would be hoping to support with this change. @markmfredrickson @josherrickson
I have no overall objections. It might make sense to transpose your idea though? So it enables something like this:
> sd <- subdim(ISM)
> sd$prob1
[1] 5 2
> sd$prob2
[1] 8 8
Each column would be the output of dim(matrix)
. Seems to me that pulling out each individual dim
will be more common than pulling out all row/columns across all subproblems. This would also mean your example code would not need modification.
One issue I see with this suggestion is what to do when there's one subproblem. Do you return a 2x1 dataframe? Or a vector? For your variation, you'd return a 1x2 dataframe which more closely mimics the vector returned by dim
.
Interesting. Relative to my proposal, Josh's has the advantages that
- Subproblem info is neatly extracted w "
$
", e.g. "sd$prob1
" ; - It's a smaller change to the function signature -- essentially returning the same things as before, but typed as
data.frame
rather thanlist
.
Its disadvantages relative to my proposal are that
3. Info about numbers of controls by subclass, or numbers of treatments by subclass, can't be neatly extracted w/ "$
";
4. in order to calculate a vector of control-to-treatment ratios by subclass, the user needs an sapply()
construction, instead of e.g. sd$num_treatments/sd$num_controls
.
Because of (4), the usage scenario of my example favors my original proposal. But not by a lot.
Thanks for the comment, @josherrickson .
Toying with Josh's idea a little: maybe it could slip into my usage scenario better if given suitable row names? I'm not sure if these are the best ones, but supposing it was c("treatments", "controls")
,
subdims_em <- subdim(em)
m2 <- fullmatch(pr ~ t1 + t2, within=em, data=nuclearplants,
mean.controls=pmin(1.5, subdims_em["controls",] / subdims_em["treatments",])
)
As compared to a $
-based syntax, there's still less user convenience and more room for user error than I'd like. But better than the sapply
based construction in [issue129 eba2750], I think.
The version of subdim() that we implemented can't handle BISMs with no eligible matches in any subclass.
Z <- rep(c(0,1), 4)
B <- rep(c("a", "b"), each=4)
x <- c((1L:4L)/10, (1L:4L) *10)
m <- exactMatch(Z ~ B)
m <- match_on(Z ~ x, within=m, method="euclidean")
m2 <- caliper(m, width=0.05)
subdim(m2)
## Error in `.rowNamesDF<-`(x, value = value) : invalid 'row.names' length
the problem arises b/c this data frame has no columns, there being no valid nontrivial subproblems.
This wouldn't have come about had subdim() returned the transpose of what it now gives back, i.e. a data frame with a row for each valid subproblem. Then subdim() could simply have returned a data frame with no rows. I had suggested this above in the comment thread, then instead transposed at suggestion of @josherrickson . Now that I'm seeing this issue, I propose to revert to the form that I had proposed. In the meantime, feel free to speak up w/ any reservations or comments.
@josherrickson any comments on my proposal just above? (It goes somewhat against a suggestion you made further up in the same comment thread, a couple years back.)
I think the transpose makes sense.