sklearn-onnx icon indicating copy to clipboard operation
sklearn-onnx copied to clipboard

Divergence in HistGradientBoostingClassifier's scores

Open maximilianeber opened this issue 2 years ago • 7 comments

Hi,

I am trying to build a standard pipeline for tabular data that works nicely with ONNX. Ideally, the pipeline would:

  1. Be based on boosted trees
  2. Gracefully support mixed types (categorical/numerical)
  3. Exploit boosted trees' native support for categoricals
  4. Exploit boosted trees' native support for missing values

To keep debugging simple, I have built a pipeline that covers points 1-3. Preprocessing works fine, but HistGradientBoostingClassifier returns different predictions (see gist).

Any ideas why this might happen? Are there known issues with HistGradientBoostingClassifier?

Thank you!

Package versions:

scikit-learn==1.3.*
skl2onnx==1.16.*
onnxruntime==1.16.*

maximilianeber avatar Dec 19 '23 13:12 maximilianeber

After some digging, I think this might be related to missing categorical support — everything works as expected when using one-hot encoding in the preprocessor.

@xadupre I am happy to try filing a PR if you think it's a good idea to add support for categoricals. Wdyt?

maximilianeber avatar Jan 03 '24 11:01 maximilianeber

I did not check their implementation recently but if scikit-learn supports categories the same way lightgbm does, I guess they use the rule if x in set(cat1, cat2, ...) which is not supported by onnx. onnxmltools deals with that case by multiplying nodes (https://github.com/onnx/onnxmltools/blob/main/onnxmltools/convert/lightgbm/operator_converters/LightGbm.py#L841) but the best way would be to update onnx to supports that rule. That said, I do think it is a good idea to support categorical features.

xadupre avatar Jan 03 '24 15:01 xadupre

The right of doing it is to implement the latest onnx specifications (https://github.com/onnx/onnx/pull/5874) and then to update onnxruntime to support it.

xadupre avatar Apr 04 '24 07:04 xadupre

The probloem with one-hot encoding is Histogram Gradient boosting might learn weird interactions between each one-hot encoded feature during modeling. Therefore, it might not be the same as specifying that feature as categorical in model definition with categorical_features: https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.HistGradientBoostingRegressor.html#sklearn.ensemble.HistGradientBoostingRegressor

ogencoglu avatar Jun 04 '24 13:06 ogencoglu

The right of doing it is to implement the latest onnx specifications (onnx/onnx#5874) and then to update onnxruntime to support it.

Sorry for being so late in replying. Sadly, we haven't found the capacity to contribute upstream this quarter. 👎

Therefore, it might not be the same as specifying that feature as categorical in model definition with categorical_features

Agreed. The other downside of one-hot encoding is that you need a lot of memory when the cardinality of the categorical feature(s) is high.

maximilianeber avatar Jun 04 '24 13:06 maximilianeber

The right of doing it is to implement the latest onnx specifications (onnx/onnx#5874) and then to update onnxruntime to support it.

I think an update to onnxruntime is pending review :)

adityagoel4512 avatar Aug 14 '24 09:08 adityagoel4512

I think an update to onnxruntime is pending review :)

It's been merged in now.

khoover avatar Jan 23 '25 23:01 khoover