optmatch
optmatch copied to clipboard
dedicated structure for node prices & related info
This structure might incorporate elements and functionality from the Adam Rauh's Optmatch S4 class, but in contrast to that class it won't attempt also to encode the factor type encoding of matches as in the current s3 optmatch class. (These structures might eventually be incorporated into a class like that.)
Road map: vignettes/dual-MCF-solution.Rmd in branch issue54-hinting.
As of 34e2474 , this compiles into dual-MCF-solution.pdf.
For MCFSolutions
objects I don't think we actually need object@matchables
: the info in there is available as
object@nodes %>% dplyr::filter(kind!="bookkeeping") %>% dplyr::select(name, kind,subproblem)
Unless we might need a separate table to keep track of units present in the originating matching problem but flagged as unmatchable en route to the MCF solver. (Not the original motivation for this table, that being no longer relevant, but potentially worthwhile.) However, in this case the slot should be called "unmatchables
", not "matchables
".
Another adjustment we'll need: the user's treatment/control dichotomy may differ from the parallel dichotomy as understood by the solver, i.e. "T(f)" vs "C(f)" in this picture:
That's b/c this part of .fullmatch()
in some cases flips the roles of the two before passing down toward the solver:
# if omf is specified (i.e. not NA), see if is non-negative
# if omf is not specified, check to see if mxctl is > .5
if (switch(1 + is.na(omf), omf >= 0, mxctl > .5)) {
maxc <- min(mxctl, ncol)
minc <- max(mnctl, 1/nrow)
omf.calc <- omf
} else {
maxc <- min(1/mnctl, ncol)
minc <- max(1/mxctl, 1/nrow)
omf.calc <- -1 * omf
d <- t(d)
}
temp <- SubDivStrat(rownames = rownames(d),
colnames = colnames(d),
distspec = d,
max.cpt = maxc,
min.cpt = minc,
tolerance = TOL * tol.frac,
omit.fraction = if(!is.na(omf)) { omf.calc }) # passes NULL for NA
return(temp)
}
For purposes of restarting a min-cost flow solver, of course, the original orientation will need to be preserved. In practice I think this means that the SubProbInfo
class will have to carry information about the mapping of treatment/control to "upstream" vs "downstream" non-bookkeeping nodes, i.e. T vs C in above picture.
As Mark noted on #166, joining of NodeInfo and ArcInfo tables (e.g. functions in R/complementarySlackness.R) along a factor would be aided by having the two factors have a common set of levels. Perhaps MCFSolutions objects could observe rules that:
- The
groups
column of@subproblems
(a SubProbInfo object) serves as the level set for@nodes$groups
,@arcs@matches$groups
, and@arcs@bookkeeping$groups
. - The
name
column of@nodes
serves as the level set for@arcs@matches$upstream
,@arcs@matches$downstream
,@arcs@bookkeeping$start
and@arcs@bookkeeping$end
. In practice this might be implemented by changing the type of@nodes$groups
, ...,@arcs@bookkeeping$end
to integer or factor, then defining methods for them in such a way that when they live together in an MCFSolutions object they're integer variables but when you extract just the ArcInfo or NodeInfo component they get coerced to factor, with level set as just described.
Variation on (2):
2'. The name
column of @nodes
becomes index
, a column of integers pointing to elements of @matchables$name
. Also @matchables$name
serves as the level set for @arcs@matches$upstream
, ..., @arcs@bookkeeping$end
.
@markmfredrickson does it seem to you that this would make dplyr::left_join()
happier (cf #166)?
Currently SubProbInfo doesn't attempt to describe subproblems' structural restrictions (min.controls etc). That was intentional, to avoid binding the class too tightly to full matching, as opposed to other matching schemes implemented with min cost flow solvers. However, it would help #76 to be able to quickly compare specific rows of 2 SubProbInfo tables to determine whether they're describing the same subproblem.
Perhaps the FullmatchMCFSolutions class should add columns to its SubProbInfo that describe these restrictions, along with a new logical column the flags whether restrictions stated in the table describe all of the constraints used to define that subproblem. When the answer to that question is no, then we'd know that one has to compare @nodes
and @arcs
entries in order to decide whether two subproblems were the same.
I no longer agree w/ my previous (5/28/2019) comment on this thread: subproblems of problems A and B should be treated as the same iff it's a context where A and B are expected to have the same subproblems, and the subproblems have the same groups
and hashed_dist
entries.
As a result, I think this is all but done. Two things I think merit additional attention:
- [ ] The class names
MCFSolutions
andFullmatchMCFSolutions
should be revised.
The structures code networks and corresponding flows, but the encoded flows needn't solve any implicit min-cost-flow problem, and I'd like to be able to use them without that presumption. Perhaps BipMatchNF
/FullMatchNF
instead, where "BipMatch" means "bipartite match" and "NF" denotes "network flow"?
- [ ] These classes should have a slot for ~~the name(s) of the variable(s) that were used to define~~ a lookup table associating combinations of blocking variables with levels of
groups
. A slot for this lookup table should be added to the BISM definition.
To facilitate eventual joins involving whatever variables stand behind groups
. The original names of this/these variable(s) can be recorded as column headers in the lookup table. Whatever slot of this nature may be added to MCFSolutions
/BipMatchNF
class should be added to the BlockedInfinitySparseMatrix
class in the same stroke, as that will be the place to store this information when it's extracted from a match_on.formula()
or exactMatch()
call.
- I'm ok with either naming scheme, though I think longer is better for something we will rarely have to type but could benefit from the additional information.
- I don't see any harm in the second idea if it provides functionality.
Three more things:
1. I suspect the SubProblemInfo's should store dual and primal values, not dual and Lagrangian values: it's the dual-primal gap that bounds the regret. So, unless we think better of this, there's one more to-do:
- [ ] ~~replace "
lagrangian
" with~~ add "primal
" to columns of SubProbInfo
(It's unclear whether the Lagrangian column should stay, or be replaced with the dual calculation. When a problem was found to be infeasible, "primal value" arguably doesn't make sense, while the Lagrangian does, and might conceivable provide a basis for insights into how far from feasible the problem was. But this is conjectural.)
I'm still not seeing how the Lagrangian and the dual objective, without the primal value, could give us information about regret that could be used for regret bounds (#164) and/or prioritizing subproblems for updating (#167). (Of course "regret" isn't meaningful for subproblems found to be infeasible. )
2. For the reasons noted in this comment above from 2 years back, to facilitate #167 we should be recording problem restrictions in the subproblem table. In contrast the musing in that comment, however, it's simpler to
- [ ] Allow SubProbInfo tables to have arbitrary additional columns, while insisting on specific additional columns for SubProbInfo components of an MCFSolutions object.
Those additional columns would include min.controls=
, max.controls=
, and omit.fraction=
, as well as another column to flag whether all of the problem restrictions were encoded here.
3. Minor: The current MCFSolutions.Rmd
vignette creates issues upon make check due to its use of the magrittr pipe. Since R 4.1.0 and later offer |>
as a native forward pipe, I should try it out here once I've updated my R.
- [x] Update MCFSolutions.Rmd vignette
@adamrauh I'm inviting your contributions here, toward closing the four open to-do's noted in the comments of this issue. The first of the four, about renaming classes, should probably come last, after handling the others and then using the comments field here, to inform the other contributors about the proposed new names and solicit comments.
For ease of reference, here's the current (last touched June 2022) version of the roadmap document of this project, as pdf: MCFSolutions-d1bb5c0.pdf. It won't be entirely self-explanatory, and I'll be happy to answer questions and/or walk you through it; but do have a look through first, if you haven't already done so.
I've been toying with some solutions aimed at dealing with tasks 3 and 4, which you can see in full here. My thought was that the storage, extraction and general management of subproblem data is likely to be a non-trivial component of future work related to warm starts, hints, etc, so it's worth trying to modularize/separate out this functionality as we develop related features.
-
I just implemented a new function in a new file here for convenience -- but I could easily see some kind of
update()
method for theSubProbInfo
class offering a cleaner implementation of the same thing. -
In this version, subproblem constraints are recorded right after being passed back from the solver.
min.cpt
,max.cpt
,omit.fraction
might not correspond to what the user specified originally, but this strikes me as the right level at which to be recording these. It looks like the originally specifiedmax.controls
,min.controls
, andomit.fraction
values are being stored as attributes already on the object returned byfullmatch()
orpairmatch()
Would welcome any and all thoughts on this
- I like the modularization you've added.
1'. Rather than writing aupdate.SubProbInfo()
method, I'd encourage coming up with a name that doesn't suggest that it's an SPI update method, this would suggest that it takes in an SPI and spits one out, whereas this is really taking in and operating on an MCFSoln. If I'm reading it correctly, I also think it would be cleaner for the new function to to operate ontemp[["MCFSolution"]]
rather than justtemp
, and to spit out an MCFSoln. 2. I agree with you about which subprob constraints to grab and record.
To document my experiments and their status on this front:
Regarding:
- [x] add "
primal
" to columns of SubProbInfo
This is done easily enough, implemented here. As mentioned previously, this work is now being wrapped into a separate function. It seems very plausible that a set of functions for getting, setting, modifying etc. subproblem data is likely to developed in the future. So, this is a potential step in that direction.
- [x] Allow SubProbInfo tables to have arbitrary additional columns, while insisting on specific additional columns for SubProbInfo components of an MCFSolutions object.
This is handled within the same function, at least for the time being. The current implementation should make it easy for additional parameters to be tracked, so long as they are passed correctly to the line here. Specifically, the code adds a column to the SubProbInfo
table for each named argument passed in to trackSubProblemInfoConstraints()
via ...
.
- [ ] These classes should have a slot for ~~the name(s) of the variable(s) that were used to define~~ a lookup table associating combinations of blocking variables with levels of
groups
. A slot for this lookup table should be added to the BISM definition.
Commits here and here have a hacky implementation working towards a solution for this.
The general overview of what's happening here is that the mapping between values of variables that define the internal "groups" or "subproblemids" is saved at the point it is created, at which point it needs to be passed back up the chain and returned as part of the solution. Right now, this is done by stapling on an attribute to BISMs called groupTable
-- this is then added as an attribute to the subproblems
slot within an MCFSolutions
object.
To offer a quick example:
data('nuclearplants')
m3 <- fullmatch(pr ~ t1 + t2 + strata(pt) + strata(ct), data=nuclearplants)
attr(attr(m3, "MCFSolutions")@subproblems, 'groupTable')
returns
pt ct group
1 0 0 0.0
2 0 1 0.1
3 1 0 1.0
4 1 1 1.1
This approach was relatively straightforward to implement and did not require modifying key function signatures, constructors, etc. But, a more robust approach utilizing the existing S4 classes and associated toolboxes should be implemented in the future. We should certainly think more about what attributes/slots go where and why. I was just avoiding getting bogged down in the details of such an implementation for the time being.
The current implementation is more of an experiment than a long term solution. That said, it does currently pass all tests.
I should note that it also has the virtue of recording all possible groups/subproblems that can be constructed using the specified variables, regardless of whether or not that subproblem is solved. So, note the existence of group 1.1 above is missing here:
> attr(m3, "MCFSolutions")@subproblems
Object of class "SubProbInfo"
groups flipped hashed_dist resolution lagrangian_value dual_value feasible exceedance primal_value min.cpt max.cpt omit.fraction
1 0.0 FALSE 1261.03057757336 0.001230769 11.070888 11.062077 FALSE -0.003573720 11.070888 0.2500000 10 NA
2 1.0 FALSE 1261.03057757336 0.001230769 3.892905 3.892905 FALSE -0.001864466 3.892905 0.3333333 2 NA
3 0.1 FALSE 1261.03057757336 0.001230769 17.007011 16.996267 FALSE -0.002835035 17.007011 0.3333333 9 NA
- [ ] The class names
MCFSolutions
andFullmatchMCFSolutions
should be revised.
Given the experimental nature of what I have here, I just left the names as is.