taxdata icon indicating copy to clipboard operation
taxdata copied to clipboard

[Help Wanted] Combining stage 2 and 3, trouble with optimization problem

Open chusloj opened this issue 4 years ago • 1 comments

In my branch stage_2+3, I've been trying to combine stages 2 and 3 of the optimization process for PUF weights. I've expanded the variable for interest income (INTS) to 12 new variables, each of them pertaining to the 12 income bins specified in the SOI update instructions. . Running stage1_just_INT.py from my branch to produce Stage 2 targets based on these new variables and then running stage2.py, I get the following error from Julia's JuMP solver:

Screen Shot 2021-01-27 at 10 55 30 AM

A similar error was thrown by the CVXOPT optimization package used previously for Stage 2. CVXOPT's error said the coefficient matrix produced in dataprep.py (I use dataprep_INT.py in my branch) does not have full rank.

I'm trying to identify why the optimization problem is infeasible, and I've had no luck so far.

chusloj avatar Jan 27 '21 16:01 chusloj

There are a lot of degrees of freedom here. I'd expect it to solve. My guess is you either have redundant constraints and your solver does not allow them, or one or more bad targets (relative to the data), or tightly restricted record weights that are inconsistent with the world as we know it (SOI data), but there are other possibilities.

I'd suggest isolating things to determine what, exactly, is causing the problem. Julia may have tools for that. If not, I'd suggest:

  • First, make sure that you've dropped INTS if you've included INTS_1-INTS_12 (or whatever you call them). Technically, this should not be the problem - INTS simply will be redundant, but I'd make sure it's not in there nonetheless. Some solvers could have problems with this; others may eliminate redundant constraints. Your "does not have full rank" error suggests that this could be the problem. I suppose some other combination of constraints could be redundant, but I doubt it; I assume Julia or python has tools that will allow you to figure out why it is not full rank -- R does; I would think they are common in languages that work with matrices.
  • I assume, also, that you have not adjusted the data via stage 3. Obviously that would be bad.
  • Work with a single year - the first year that did not solve.
  • Figure out if one or more INTS[i] targets seem implausible relative to the data. (If so, it's probably the data that are implausible, not the target.)
  • For example... compare the INTS[i] targets to initial values for them using previously constructed weights that were developed using INTS target but not INTS[i] targets, or some other pretty good starting weights). Look for income ranges i where sum(INTS[i] * s006[i], summed over records) minus target[i], or proportionate difference from target[i], is large and implausible. The Julia solver may provide this information automatically. Figure out which targets are bad (relative to the data - the problem may be with data).
  • If there are good candidates for bad targets (bad income ranges), drop those targets and see if it solves.
  • If not, next I'd check restrictions on weights. I assume you are running it with very narrow ranges allowed for the new weights, as was done with stage 2 last time I looked at it. I think it allowed weights to be +/- 50% of their initial values. To find out, first I'd get a distribution of the ratio of new (unsuccessful) to initial weights. Are a lot of them on the boundary of what is allowed? If so, expand the boundaries - for example 0 to infinite. Rerun and examine the distribution. Does it solve, but with a lot of weights far outside what you had hoped to restrict them to? My guess is that this is what you'll find. If so, then this is partly a philosophical question: is it plausible to think that in year X (whatever year you're looking at) the weights should be extremely close to those you used as initial weights (for example, if you are targeting 2017 and your data are from 2011, this might not be a realistic assumption or restriction)? If not, then I'd loosen the restriction on weights. But if you are wedded to those restrictions, then you have to accept that you can't represent the world as you know it with the data as you have it, with the weights as you want to restrict them, and you'll have to decide which targets to loosen.
  • OTOH if it still doesn't solve after loosening weight restrictions, search methodically for targets that in combination are bad even though none looked implausible individually.
  • Run it with just WAGE_1-WAGE_12 and INTS[i] for all i and see if it solves (i.e., drop the aggregate other targets so we focus only on those that are income-range specific).
  • If it does not solve, then there is some inconsistency between the INTS[i] targets and the WAGE[i] targets in combination even though none of them looked implausible individually (from earlier tests). Do a binary search to look for it (e.g, run with all 12 wage targets and only 6 of the INTS targets, and so on) until you find what does not work.
  • Assuming you find a combination of INTS-WAGE targets that causes the failure, and it is not because you fed it incorrect targets, then you probably have data that just don't work with the targets (I find this unlikely). Short term fix is to loosen up the tolerances around this bad INTS[i] (or set of INTS[i] targets) until it solves, then on a more leisurely schedule, try to figure out what is wrong with the data (were the growfactors bad?).
  • OTOH, if it solves with all INTS-WAGE targets, then, there is some inconsistency between the INTS[i] targets and the other aggregate (not range-specific) targets. Seems unlikely, but if so, do a binary search for what's not working (add subsets of the aggregate targets back in). Consider a larger tolerance around one or more of these targets, or around the INTS targets.
  • and so on, breaking it down until you find what is causing the problem

donboyd5 avatar Jan 27 '21 16:01 donboyd5