Warning when using sparse categorical values

Open mjalse opened this issue 1 year ago • 1 comments

I have question about a warning message when training a LightGBM model with lgbm.train. I get the following warning:

[LightGBM] [Warning] Met categorical feature which contains sparse values. Consider renumbering to consecutive integers started from zero

The reason is that I have a column, specified as categorical, that contains the following integers:

[1015, 1033, 1128, 1398, 1541, 1673, 1677]

In the documentation it says:

"All values in categorical features will be cast to int32 and thus should be less than int32 max value (2147483647). Large values could be memory consuming. Consider using consecutive integers starting from zero."

My values are not particularly large. The "consider using consecutive integers starting from zero" seems to be a suggestion. What happens if they do not? How does the sparseness affect the performance of LightGBM? Another categorical column of my dataset has the three values

[1, 3, 4]

and this column does not cause the same warning.

Mar 26 '24 09:03 mjalse

I guess that was the reason.

Optimal Split for Categorical Features: ... LightGBM sorts the histogram (for a categorical feature) according to its accumulated values (sum_gradient / sum_hessian) and then finds the best split on the sorted histogram.

So if we use a categorical feature with many different sparse values, a large histogram would be generated and it can be memory consuming.

Pardon me if I am wrong.

Mar 26 '24 12:03 Yingjie-Zhao