Equality constraints involving params values not always caught correctly
Bug description
The following constraint:
{
'query': "category == 'transition' & name2 == 'x' & name1 == 'y' & not (-1e-10 < value < 1e-10)",
'type': 'equality'
}
is not caught correctly because in line 62 of process_selectors.py, we have:
selected = evaluator(helper)
with helper (after applying the query):
value lower_bound
category period name1 name2
transition 0 y x 200 -inf
1 y x 216 -inf
2 y x 232 -inf
3 y x 248 -inf
That is, the value fields of helper contain the indices in internal_params (I suppose) instead of the value field of the original params DataFrame.
What I am actually trying to achieve is to exclude the rows with values 200 and 248 in the above DataFrame because they are fixed to zero. The corresponding constraints may change as I try different specifications, so I want to do so in a programmatic way -- I'd be curious to know if there were better options.
To Reproduce
I'd hope the above is clear enough, but I can try to distill an example.
Expected behavior
Rows values 200 and 248 should not appear in the result of evaluator(helper) because their value in the DataFrame passed to estimate_ml as the params argument is 0.0.
System
- Version 0.4.3
Potential solution
Around line 40 of process_selectors, define helper as:
helper = params.copy()
helper["internal_index"] = tree_converter.params_unflatten(np.arange(n_params))["value"]
and adjust _get_selection_evaluator accordingly. This would work out-of-the box for field == "query" with an appropriate adjustment of the key (value -> internal_index in line 181). However, I do not see through what it would imply for other types of fields.
Thanks for the detailed issue and debugging!
I think this is a general problem that occurs if a query or selector uses parameter values.
The reason why we replace params with the helper is that the selected helper values will be the indices of a flattened params vector.
Changing this in general is actually a non-trivial problem and it will take a while to fix this.
Thanks, I understand that. But why would you require the "value" field in the helper and can't just use an arbitrary name?
We use the "value" field because it avoids rewriting tree flattening rules. But we might be able to avoid it.
The whole problem would be trivial to fix if we treated your case of 1 params DataFrame as a special case. Since users can also use DataFrames deep inside nested dictionaries, our solution needs to be more general.
For your case, a simple workaround could be to evaluate your queries on the start params and replace them by locs. i.e.
new_constraints = []
for constr in constraints:
if "query" in constr:
new_constr = constr.copy()
q = new_constr.pop("query")
new_constr["loc"] = start_params.query(q).index
new_constraints.append(new_constr)
else:
new_constraints.append(constr)
Almost. This works well for the entire start_params, but helper only has the parameters that actually vary in the optimization problem. Hence, the index of parameters that are fixed does not exist and the estimation errors out with an IndexError.
I'll try to think through some further workaround...
Yes, I would run this on your start params before you pass the constraints to estimagic.
That is what I did. But different .loc-constraints converted this way fail because the corresponding index is not present anymore in helper.
Put differently,
>>> len(start_params)
360
>>> len(helper)
80
A query just returns an empty object, a loc-constraint yields an IndexError.
I am very surprised.
We define helper as:
n_params = len(tree_converter.params_flatten(params))
helper = tree_converter.params_unflatten(np.arange(n_params))
This is before any constraints handling is done, i.e. at this point we do not even know which parameters will be fixed later. helper should look exactly like params, except for replacing parameter values with integers ranging from 0 to n_params - 1.
To explain a bit more: We always treat parameters as a general pytree that can be flattened and unflattened. Even if params is just one DataFrame.
A solution like:
helper = params.copy()
helper["internal_index"] = tree_converter.params_unflatten(np.arange(n_params))["value"]
Assumes that params is something that has a value column and can be assigned new columns. We cannot make this assumption in general.
Thanks, I'll run it through the debugger again asap and will report back with more details. Probably my mistake somewhere.
Thinking of it: Most likely explanation is that this error occurs now during the generation of sensible starting values for a full skillmodels run, i.e. the reduction in parameters is because of that. So will have to do the conversion from query to loc before setting each different model to run!
That was it.
I'd think the use case is sufficiently rare to not support it. The way to go would be to exclude if a regex like string_to_error_out here:
delimit = r"[ \n&|><=()]"
string_to_error_out = f"{delimit}value{delimit}"
shows up in a query.
Error message should include an explanation and your workaround, I am using this function now:
def convert_queries_to_locs(start_params, constraints):
new_constraints = []
for constr in constraints:
if "query" in constr:
new_constr = constr.copy()
q = new_constr.pop("query")
q_result = start_params.query(q)
if not q_result.empty:
new_constr["loc"] = q_result.index
new_constraints.append(new_constr)
else:
new_constraints.append(constr)
return new_constraints
As you say, this is sufficiently rare that we don't need to support it. Moreover, it's made partially obsolete with #495 where we deprecate queries as a means to select parameters (they can still be used inside selectors of course, but estimagic won't know about that). Your fix will remain mostly unaffected.
The documentation of constraints will be rewritten completely in the process and I'll make sure that this case is explained clearly.