LightGBM icon indicating copy to clipboard operation
LightGBM copied to clipboard

[R-package] [python-package] More convenient API for interaction_constraints

Open mayer79 opened this issue 11 months ago • 9 comments

Interaction constraints are extremely useful in practice. The API in LightGBM has two aspects that could be improved:

  1. Currently, the Python package allows only integer positions like [[0], [1, 2]]. We could also allow feature names as in XGBoost, or in the R-package of LightGBM.
  2. Features not appearing in any interaction group are silently dropped from the training altogether. Unlike in Scikit-Learn and XGBoost, where missing features can interact (i.e., build an own interaction group). This applies to both Python/R-APIs. Above example could be entered as [[0]], or as [["living_area"]].

Both aspects help to reduce user mistakes and aligns the API with XGBoost (and partly with Scikit-Learn's histogram gradient boosting implementation).

I can make two PRs: one for the R-package, and one for the Python package.

@jameslamb @btrotta

mayer79 avatar Mar 20 '24 09:03 mayer79

@jameslamb could you also add a Python tag? There, we would need two things: translating feature names to int positions, and adding the set of skipped features as additional set.

mayer79 avatar Mar 22 '24 09:03 mayer79

Question to @jameslamb and others: In Python, the interaction constraints are transformed into a string here, just like any other list of list-like object:

https://github.com/microsoft/LightGBM/blob/master/python-package/lightgbm/basic.py,

def _param_dict_to_str(data: Optional[Dict[str, Any]]) -> str:
    """Convert Python dictionary to string, which is passed to C API."""
    if data is None or not data:
        return ""
    pairs = []
    for key, val in data.items():
        if isinstance(val, (list, tuple, set)) or _is_numpy_1d_array(val):
            pairs.append(f"{key}={','.join(map(_to_string, val))}")
        elif isinstance(val, (str, Path, _NUMERIC_TYPES)) or _is_numeric(val):
            pairs.append(f"{key}={val}")
        elif val is not None:
            raise TypeError(f"Unknown type of parameter:{key}, got:{type(val).__name__}")
    return " ".join(pairs)

Is it an option to add a snippet like this?

if key == "interaction_constraints":
  val = _prep_interaction_constraints(vals, feature_names)

_prep_interaction_constraints() would do two things:

  1. Translate feature names to integers (if strings)
  2. Add a list with omitted features

Both would need access to the list of feature_names, so we would need to either pass feature_names = None or the Dataset object to _param_dict_to_str(). Not sure if this the right way to proceed?

mayer79 avatar Mar 26 '24 16:03 mayer79

could you also add a Python tag?

We don't actually have a python label but probably should. I've at least edited the PR title for now.

jameslamb avatar Mar 27 '24 02:03 jameslamb

hmmmm @btrotta @guolinke if interaction_constraints is provided, should features that do not appear be excluded from training?

At first I agreed with @mayer79 's assessment that they should not be excluded, but seeing this docstring makes me think they're intentionally excluded:

https://github.com/microsoft/LightGBM/blob/28536a0a5d0c46598fbc930a63cec72399a969e4/include/LightGBM/config.h#L575

jameslamb avatar Mar 27 '24 02:03 jameslamb

Is it an option to add a snippet like this?

I'd prefer that that be done outside of _param_dict_to_str(), similar to how categorical_feature is resolved:

https://github.com/microsoft/LightGBM/blob/28536a0a5d0c46598fbc930a63cec72399a969e4/python-package/lightgbm/basic.py#L2110-L2130

I'd support it being a private function, so it can be shared between cv() and train().

jameslamb avatar Mar 27 '24 02:03 jameslamb

I'm generally supportive of everything you've proposed at the top of this issue, thanks for working on it and for adding that context about consistency with XGBoost and scikit-learn!

Let's just please wait to hear from @btrotta or @guolinke about whether omitting features not mentioned in interaction_constraints was intentional, there might be something from #2884 and #3130

jameslamb avatar Mar 27 '24 02:03 jameslamb

Sorry, I don't remember about that 🤣, I think it was not intentional. We can align with XGBoost.

guolinke avatar Mar 27 '24 03:03 guolinke

😂 ok thank you, I don't remember either. Agree then, let's move forward @mayer79

jameslamb avatar Mar 27 '24 03:03 jameslamb

Is it an option to add a snippet like this?

I'd prefer that that be done outside of _param_dict_to_str(), similar to how categorical_feature is resolved:

https://github.com/microsoft/LightGBM/blob/28536a0a5d0c46598fbc930a63cec72399a969e4/python-package/lightgbm/basic.py#L2110-L2130

I'd support it being a private function, so it can be shared between cv() and train().

Thanks for the hint. I will soon draft the function and then we determine where to place it :-).

mayer79 avatar Mar 27 '24 06:03 mayer79