optmatch dedicated structure for node prices & related info

This structure might incorporate elements and functionality from the Adam Rauh's Optmatch S4 class, but in contrast to that class it won't attempt also to encode the factor type encoding of matches as in the current s3 optmatch class. (These structures might eventually be incorporated into a class like that.)

May 08 '19 21:05 benthestatistician

Road map: vignettes/dual-MCF-solution.Rmd in branch issue54-hinting.

As of 34e2474 , this compiles into dual-MCF-solution.pdf.

May 14 '19 20:05 benthestatistician

For MCFSolutions objects I don't think we actually need object@matchables: the info in there is available as

object@nodes %>% dplyr::filter(kind!="bookkeeping") %>% dplyr::select(name, kind,subproblem)

Unless we might need a separate table to keep track of units present in the originating matching problem but flagged as unmatchable en route to the MCF solver. (Not the original motivation for this table, that being no longer relevant, but potentially worthwhile.) However, in this case the slot should be called "unmatchables", not "matchables".

May 16 '19 17:05 benthestatistician

Another adjustment we'll need: the user's treatment/control dichotomy may differ from the parallel dichotomy as understood by the solver, i.e. "T(f)" vs "C(f)" in this picture:

That's b/c this part of .fullmatch() in some cases flips the roles of the two before passing down toward the solver:

    # if omf is specified (i.e. not NA), see if is non-negative
    # if omf is not specified, check to see if mxctl is > .5
    if (switch(1 + is.na(omf), omf >= 0,  mxctl > .5)) {
      maxc <- min(mxctl, ncol)
      minc <- max(mnctl, 1/nrow)
      omf.calc <- omf

    } else {
      maxc <- min(1/mnctl, ncol)
      minc <- max(1/mxctl, 1/nrow)
      omf.calc <- -1 * omf
      d <- t(d)
    }

    temp <- SubDivStrat(rownames = rownames(d),
                        colnames = colnames(d),
                        distspec = d,
                        max.cpt = maxc,
                        min.cpt = minc,
                        tolerance = TOL * tol.frac,
                        omit.fraction = if(!is.na(omf)) { omf.calc }) # passes NULL for NA

    return(temp)
  }

For purposes of restarting a min-cost flow solver, of course, the original orientation will need to be preserved. In practice I think this means that the SubProbInfo class will have to carry information about the mapping of treatment/control to "upstream" vs "downstream" non-bookkeeping nodes, i.e. T vs C in above picture.

May 17 '19 17:05 benthestatistician

As Mark noted on #166, joining of NodeInfo and ArcInfo tables (e.g. functions in R/complementarySlackness.R) along a factor would be aided by having the two factors have a common set of levels. Perhaps MCFSolutions objects could observe rules that:

The groups column of @subproblems (a SubProbInfo object) serves as the level set for @nodes$groups, @arcs@matches$groups, and @arcs@bookkeeping$groups.
The name column of @nodes serves as the level set for @arcs@matches$upstream, @arcs@matches$downstream, @arcs@bookkeeping$start and @arcs@bookkeeping$end. In practice this might be implemented by changing the type of @nodes$groups, ..., @arcs@bookkeeping$end to integer or factor, then defining methods for them in such a way that when they live together in an MCFSolutions object they're integer variables but when you extract just the ArcInfo or NodeInfo component they get coerced to factor, with level set as just described.

Variation on (2): 2'. The name column of @nodes becomes index, a column of integers pointing to elements of @matchables$name. Also @matchables$name serves as the level set for @arcs@matches$upstream, ..., @arcs@bookkeeping$end.

@markmfredrickson does it seem to you that this would make dplyr::left_join() happier (cf #166)?

May 28 '19 15:05 benthestatistician

Currently SubProbInfo doesn't attempt to describe subproblems' structural restrictions (min.controls etc). That was intentional, to avoid binding the class too tightly to full matching, as opposed to other matching schemes implemented with min cost flow solvers. However, it would help #76 to be able to quickly compare specific rows of 2 SubProbInfo tables to determine whether they're describing the same subproblem.

Perhaps the FullmatchMCFSolutions class should add columns to its SubProbInfo that describe these restrictions, along with a new logical column the flags whether restrictions stated in the table describe all of the constraints used to define that subproblem. When the answer to that question is no, then we'd know that one has to compare @nodes and @arcs entries in order to decide whether two subproblems were the same.

May 28 '19 16:05 benthestatistician

I no longer agree w/ my previous (5/28/2019) comment on this thread: subproblems of problems A and B should be treated as the same iff it's a context where A and B are expected to have the same subproblems, and the subproblems have the same groups and hashed_dist entries.

As a result, I think this is all but done. Two things I think merit additional attention:

[ ] The class names MCFSolutions and FullmatchMCFSolutions should be revised.

The structures code networks and corresponding flows, but the encoded flows needn't solve any implicit min-cost-flow problem, and I'd like to be able to use them without that presumption. Perhaps BipMatchNF/FullMatchNF instead, where "BipMatch" means "bipartite match" and "NF" denotes "network flow"?

[ ] These classes should have a slot for ~~the name(s) of the variable(s) that were used to define~~ a lookup table associating combinations of blocking variables with levels of groups. A slot for this lookup table should be added to the BISM definition.

To facilitate eventual joins involving whatever variables stand behind groups. The original names of this/these variable(s) can be recorded as column headers in the lookup table. Whatever slot of this nature may be added to MCFSolutions/BipMatchNF class should be added to the BlockedInfinitySparseMatrix class in the same stroke, as that will be the place to store this information when it's extracted from a match_on.formula() or exactMatch() call.

May 20 '21 20:05 benthestatistician

I'm ok with either naming scheme, though I think longer is better for something we will rarely have to type but could benefit from the additional information.
I don't see any harm in the second idea if it provides functionality.

May 20 '21 21:05 markmfredrickson

Three more things:

1. I suspect the SubProblemInfo's should store dual and primal values, not dual and Lagrangian values: it's the dual-primal gap that bounds the regret. So, unless we think better of this, there's one more to-do:

[ ] ~~replace "lagrangian" with~~ add "primal" to columns of SubProbInfo

(It's unclear whether the Lagrangian column should stay, or be replaced with the dual calculation. When a problem was found to be infeasible, "primal value" arguably doesn't make sense, while the Lagrangian does, and might conceivable provide a basis for insights into how far from feasible the problem was. But this is conjectural.)

I'm still not seeing how the Lagrangian and the dual objective, without the primal value, could give us information about regret that could be used for regret bounds (#164) and/or prioritizing subproblems for updating (#167). (Of course "regret" isn't meaningful for subproblems found to be infeasible. )

2. For the reasons noted in this comment above from 2 years back, to facilitate #167 we should be recording problem restrictions in the subproblem table. In contrast the musing in that comment, however, it's simpler to

[ ] Allow SubProbInfo tables to have arbitrary additional columns, while insisting on specific additional columns for SubProbInfo components of an MCFSolutions object.

Those additional columns would include min.controls=, max.controls=, and omit.fraction=, as well as another column to flag whether all of the problem restrictions were encoded here.

3. Minor: The current MCFSolutions.Rmd vignette creates issues upon make check due to its use of the magrittr pipe. Since R 4.1.0 and later offer |> as a native forward pipe, I should try it out here once I've updated my R.

[x] Update MCFSolutions.Rmd vignette

May 21 '21 03:05 benthestatistician

@adamrauh I'm inviting your contributions here, toward closing the four open to-do's noted in the comments of this issue. The first of the four, about renaming classes, should probably come last, after handling the others and then using the comments field here, to inform the other contributors about the proposed new names and solicit comments.

For ease of reference, here's the current (last touched June 2022) version of the roadmap document of this project, as pdf: MCFSolutions-d1bb5c0.pdf. It won't be entirely self-explanatory, and I'll be happy to answer questions and/or walk you through it; but do have a look through first, if you haven't already done so.

May 23 '24 22:05 benthestatistician

I've been toying with some solutions aimed at dealing with tasks 3 and 4, which you can see in full here. My thought was that the storage, extraction and general management of subproblem data is likely to be a non-trivial component of future work related to warm starts, hints, etc, so it's worth trying to modularize/separate out this functionality as we develop related features.

I just implemented a new function in a new file here for convenience -- but I could easily see some kind of update() method for the SubProbInfo class offering a cleaner implementation of the same thing.
In this version, subproblem constraints are recorded right after being passed back from the solver. min.cpt, max.cpt, omit.fraction might not correspond to what the user specified originally, but this strikes me as the right level at which to be recording these. It looks like the originally specified max.controls, min.controls, and omit.fraction values are being stored as attributes already on the object returned by fullmatch() or pairmatch()

Would welcome any and all thoughts on this

Jun 04 '24 19:06 adamrauh

I like the modularization you've added.
1'. Rather than writing a update.SubProbInfo() method, I'd encourage coming up with a name that doesn't suggest that it's an SPI update method, this would suggest that it takes in an SPI and spits one out, whereas this is really taking in and operating on an MCFSoln. If I'm reading it correctly, I also think it would be cleaner for the new function to to operate on temp[["MCFSolution"]] rather than just temp, and to spit out an MCFSoln. 2. I agree with you about which subprob constraints to grab and record.

Jun 05 '24 01:06 benthestatistician

To document my experiments and their status on this front:

Regarding:

[x] add "primal" to columns of SubProbInfo

This is done easily enough, implemented here. As mentioned previously, this work is now being wrapped into a separate function. It seems very plausible that a set of functions for getting, setting, modifying etc. subproblem data is likely to developed in the future. So, this is a potential step in that direction.

[x] Allow SubProbInfo tables to have arbitrary additional columns, while insisting on specific additional columns for SubProbInfo components of an MCFSolutions object.

This is handled within the same function, at least for the time being. The current implementation should make it easy for additional parameters to be tracked, so long as they are passed correctly to the line here. Specifically, the code adds a column to the SubProbInfo table for each named argument passed in to trackSubProblemInfoConstraints() via ....

[ ] These classes should have a slot for ~~the name(s) of the variable(s) that were used to define~~ a lookup table associating combinations of blocking variables with levels of groups. A slot for this lookup table should be added to the BISM definition.

Commits here and here have a hacky implementation working towards a solution for this.

The general overview of what's happening here is that the mapping between values of variables that define the internal "groups" or "subproblemids" is saved at the point it is created, at which point it needs to be passed back up the chain and returned as part of the solution. Right now, this is done by stapling on an attribute to BISMs called groupTable -- this is then added as an attribute to the subproblems slot within an MCFSolutions object.

To offer a quick example:

data('nuclearplants')

m3 <- fullmatch(pr ~ t1 + t2 + strata(pt) + strata(ct), data=nuclearplants)
attr(attr(m3, "MCFSolutions")@subproblems, 'groupTable')

returns

  pt ct group
1  0  0   0.0
2  0  1   0.1
3  1  0   1.0
4  1  1   1.1

This approach was relatively straightforward to implement and did not require modifying key function signatures, constructors, etc. But, a more robust approach utilizing the existing S4 classes and associated toolboxes should be implemented in the future. We should certainly think more about what attributes/slots go where and why. I was just avoiding getting bogged down in the details of such an implementation for the time being.

The current implementation is more of an experiment than a long term solution. That said, it does currently pass all tests.

I should note that it also has the virtue of recording all possible groups/subproblems that can be constructed using the specified variables, regardless of whether or not that subproblem is solved. So, note the existence of group 1.1 above is missing here:

> attr(m3, "MCFSolutions")@subproblems
Object of class "SubProbInfo"
  groups flipped      hashed_dist  resolution lagrangian_value dual_value feasible   exceedance primal_value   min.cpt max.cpt omit.fraction
1    0.0   FALSE 1261.03057757336 0.001230769        11.070888  11.062077    FALSE -0.003573720    11.070888 0.2500000      10            NA
2    1.0   FALSE 1261.03057757336 0.001230769         3.892905   3.892905    FALSE -0.001864466     3.892905 0.3333333       2            NA
3    0.1   FALSE 1261.03057757336 0.001230769        17.007011  16.996267    FALSE -0.002835035    17.007011 0.3333333       9            NA

[ ] The class names MCFSolutions and FullmatchMCFSolutions should be revised.

Given the experimental nature of what I have here, I just left the names as is.

Jun 14 '24 21:06 adamrauh

optmatch optmatch copied to clipboard

dedicated structure for node prices & related info

optmatch
optmatch copied to clipboard