LightGBM
LightGBM copied to clipboard
[R-package] [python-package] More convenient API for interaction_constraints
Interaction constraints are extremely useful in practice. The API in LightGBM has two aspects that could be improved:
- Currently, the Python package allows only integer positions like
[[0], [1, 2]]
. We could also allow feature names as in XGBoost, or in the R-package of LightGBM. - Features not appearing in any interaction group are silently dropped from the training altogether. Unlike in Scikit-Learn and XGBoost, where missing features can interact (i.e., build an own interaction group). This applies to both Python/R-APIs. Above example could be entered as
[[0]]
, or as[["living_area"]]
.
Both aspects help to reduce user mistakes and aligns the API with XGBoost (and partly with Scikit-Learn's histogram gradient boosting implementation).
I can make two PRs: one for the R-package, and one for the Python package.
@jameslamb @btrotta
@jameslamb could you also add a Python tag? There, we would need two things: translating feature names to int positions, and adding the set of skipped features as additional set.
Question to @jameslamb and others: In Python, the interaction constraints are transformed into a string here, just like any other list of list-like object:
https://github.com/microsoft/LightGBM/blob/master/python-package/lightgbm/basic.py,
def _param_dict_to_str(data: Optional[Dict[str, Any]]) -> str:
"""Convert Python dictionary to string, which is passed to C API."""
if data is None or not data:
return ""
pairs = []
for key, val in data.items():
if isinstance(val, (list, tuple, set)) or _is_numpy_1d_array(val):
pairs.append(f"{key}={','.join(map(_to_string, val))}")
elif isinstance(val, (str, Path, _NUMERIC_TYPES)) or _is_numeric(val):
pairs.append(f"{key}={val}")
elif val is not None:
raise TypeError(f"Unknown type of parameter:{key}, got:{type(val).__name__}")
return " ".join(pairs)
Is it an option to add a snippet like this?
if key == "interaction_constraints":
val = _prep_interaction_constraints(vals, feature_names)
_prep_interaction_constraints()
would do two things:
- Translate feature names to integers (if strings)
- Add a list with omitted features
Both would need access to the list of feature_names, so we would need to either pass feature_names = None
or the Dataset object to _param_dict_to_str()
. Not sure if this the right way to proceed?
could you also add a Python tag?
We don't actually have a python
label but probably should. I've at least edited the PR title for now.
hmmmm @btrotta @guolinke if interaction_constraints
is provided, should features that do not appear be excluded from training?
At first I agreed with @mayer79 's assessment that they should not be excluded, but seeing this docstring makes me think they're intentionally excluded:
https://github.com/microsoft/LightGBM/blob/28536a0a5d0c46598fbc930a63cec72399a969e4/include/LightGBM/config.h#L575
Is it an option to add a snippet like this?
I'd prefer that that be done outside of _param_dict_to_str()
, similar to how categorical_feature
is resolved:
https://github.com/microsoft/LightGBM/blob/28536a0a5d0c46598fbc930a63cec72399a969e4/python-package/lightgbm/basic.py#L2110-L2130
I'd support it being a private function, so it can be shared between cv()
and train()
.
I'm generally supportive of everything you've proposed at the top of this issue, thanks for working on it and for adding that context about consistency with XGBoost and scikit-learn!
Let's just please wait to hear from @btrotta or @guolinke about whether omitting features not mentioned in interaction_constraints
was intentional, there might be something from #2884 and #3130
Sorry, I don't remember about that 🤣, I think it was not intentional. We can align with XGBoost.
😂 ok thank you, I don't remember either. Agree then, let's move forward @mayer79
Is it an option to add a snippet like this?
I'd prefer that that be done outside of
_param_dict_to_str()
, similar to howcategorical_feature
is resolved:https://github.com/microsoft/LightGBM/blob/28536a0a5d0c46598fbc930a63cec72399a969e4/python-package/lightgbm/basic.py#L2110-L2130
I'd support it being a private function, so it can be shared between
cv()
andtrain()
.
Thanks for the hint. I will soon draft the function and then we determine where to place it :-).