GH-16524 GLM - control variables - Gaussian, Bernoulli
Issue #16524
Add a parameter to specify control variables and to remove the effects of these variables for prediction and calculation of model metrics.
When fitting the GLM, control variables are also fitted in the model, like a regular predictor. After the model is fitted, the control variables' effects are removed when predicting with the model and calculating metrics.
Requirements from customer:
1. During model training, which metrics should be used for optimization purposes (like early stopping or lambda search)?
In addition, we would prefer the same to be applied for the offset. In other words, during model training, for the metrics used for optimization purposes (like early stopping or lambda search), we would prefer the metrics to be calculated with both control variable effects and offset effects included.
2. For variable importance calculations, would you prefer: a) To include control variables in importance rankings but mark them as "control" b) To exclude control variables from importance rankings entirely
We would prefer to include control variables in importance rankings, but mark them as "control".
3. When displaying model metrics in output summaries, what would you prefer?
We would prefer the same to be applied for the offset. In other words, we would prefer two sets of metrics to be displayed: (1) with both control effects and offset included, and (2) with both control effects and offset excluded
TODO:
- [x] Add new parameter control variables
- [x] Implement scoring with/without control variables for regression distributions and the binomial distribution
- [x] Calculation scoring metrics with/without control variables (early stopping metrics only)
- [x] Edit scoring history table (add new metrics)
- [x] Variable importance - mark control variable (for example variable_control)
-
Implement scoring with/without control variables for the multinomial distribution(will be implemented in separated PR)
Tests:
- [x] The new parameter validation in Java
- [x] Test functionality with basic data in Java
- [x] Basic test control variables work in Python
- [x] Basic test control variables work in R
- [x] Scoring, prediction with/without control variables
- [x] Check scoring metrics with/without control variables
- [x] Generation Scoring history table
- [x] Variable importance
Other implementation (will be implemented in different PRs)
- Grid search
- Lambda search
- Interactions