m2cgen Support for Categorical Variables in LightGBM

Support for Categorical Variables in LightGBM

Open chris-smith-zocdoc opened this issue 4 years ago • 6 comments

LightGBM supports categorical variables using an integer encoding. https://lightgbm.readthedocs.io/en/latest/Advanced-Topics.html#categorical-feature-support

The way this is represented in the tree is using the equals operator and a pipe delimited string for the categorical set.

Example

# Normal split
"threshold": 4.285125194551891,
"decision_type": "<=",


# Categorial split
"threshold": "4||6||7||8||9||22||28||32||63||64",
"decision_type": "=="

It's important to optimize the set membership check for performance reasons. At a minimum I think we'll need to

use a data structure like a set (or binary search an array)
hoist the set to the top level so its only initialized once

I tried a naive solution (inlined ors feat == a || feat == b ...) in addition to hoisted sets in java. The performance difference was ~6x on my model which has a large number of categorical operations (~30 members in each set)

If my understanding of m2cgen is right, I think this should be a new operator in interpreters/interpreter.py

I have a PoC here https://github.com/BayesWitnesses/m2cgen/compare/master...Zocdoc:cs_categorical if you'd like to look at the code. Its just a hack and would like your guidance on making this changes in a better way.

If its helpful I can open a pr and we can discuss the code there (also github allows maintainer commits to pr branches now which is nice).

Oct 01 '19 04:10 chris-smith-zocdoc

Hey @chris-smith-zocdoc!

This is some great investigation and PoC 👍

There are a few things I'd like to understand better:

Can this just be solved with one-hot encoding? How significant is the difference in model performance?
Does XGBoost support this approach of handling categorical variables?
The fact that we pass a vector of doubles to the score function bothers me a little, since the following line:

v1.contains((int) featureRef)

may yield bad results on border values like 0.99999999, 1.9999999, etc. 4. How well does this generalize on other languages? 5. Attaching a small generated code snippet would be super helpful!

Thank you!

Oct 01 '19 16:10 izeigerman

Can this just be solved with one-hot encoding? How significant is the difference in model performance?

No, doing this with one-hot encoding requires the tree to take many more splits and each split will reduce the loss less overall, making it less likely for the training process to choose it. By aggregating multiple categories in a single split there is much more information to be gained at each split. On our datasets using the categorical encoding provides a significant improvement.

I believe this is "Exclusive Feature Bundling (EFB)" in the paper

Does XGBoost support this approach of handling categorical variables?

As far as I'm aware, XGBoost does not offer a similar feature. I believe CatBoost does offer a similar feature but I have not used that library yet.

The fact that we pass a vector of doubles to the score function bothers me a little, since the following line

It looks ugly but I think it is fine. The values are expected to be integers both in the model format and the input vector. So uses are expected to send in either 0.0d or 1.0d, providing 0.9999 shouldn't happen

How well does this generalize on other languages?

Most languages will have collection types in their standard library as well as static variables. Doing it in c will likely be the hardest. The way treelite implements this is complex (bit shifts & masking)

Attaching a small generated code snippet would be super helpful!

import lightgbm as lgb
import m2cgen as m2c
import numpy as np

from lightgbm.sklearn import LGBMRegressor

np.random.seed(seed=7)
data = np.random.randint(100, size=(1000, 2))
target = np.random.random(size=(1000, 1))

estimator = LGBMRegressor(max_depth=1, num_leaves=5, n_estimators=2)
estimator.fit(data, target, categorical_feature=[0, 1])

res = m2c.export_to_java(estimator)

with open('Model.java', 'w') as f:
    f.write(res)

import java.util.HashSet;
import java.util.Arrays;
public class Model {

    public static double score(double[] input) {
        return subroutine0(input);
    }
    public static double subroutine0(double[] input) {
        return (0) + (subroutine1(input));
    }
    public static double subroutine1(double[] input) {
        double var0;
        if (contains(var_1, input[1])) {
            var0 = 0.5248953464877794;
        } else {
            var0 = 0.5112928306594767;
        }
        double var1;
        if (contains(var_2, input[1])) {
            var1 = 0.007352130979902108;
        } else {
            var1 = -0.003600226299609507;
        }
        return (var0) + (var1);
    }
    static HashSet<Integer> var_1,var_2;
    static {
        var_1 = parseIntArray("4,7,16,25,26,29,31,35,38,45,65,68,71,83,84,88,97");
        var_2 = parseIntArray("4,6,7,16,25,26,29,31,35,36,38,41,45,54,59,64,65,66,68,69,71,83,84,88,93,97");
    }
    public static double[] addVectors(double[] v1, double[] v2) {
        double[] result = new double[v1.length];
        for (int i = 0; i < v1.length; i++) {
            result[i] = v1[i] + v2[i];
        }
        return result;
    }
    public static double[] mulVectorNumber(double[] v1, double num) {
        double[] result = new double[v1.length];
        for (int i = 0; i < v1.length; i++) {
            result[i] = v1[i] * num;
        }
        return result;
    }
    public static boolean contains(HashSet<Integer> v1, double featureRef) {
        return v1.contains((int) featureRef);
    }
    public static HashSet<Integer> parseIntArray(String input) {
        String[] strings = input.split(",");
        HashSet<Integer> numbers = new HashSet<Integer>(strings.length);
        for (int i = 0; i < strings.length; i++) {
            numbers.add(Integer.parseInt(strings[i]));
        }
        return numbers;
    }
}

Oct 01 '19 16:10 chris-smith-zocdoc

@izeigerman did you have any more thoughts on how I should approach the implementation for this?

Oct 14 '19 21:10 chris-smith-zocdoc

@chris-smith-zocdoc sorry for taking so long. I'm still thinking this through. Although the proposal sounds reasonable I'm still not sure what to do about C.

Oct 17 '19 02:10 izeigerman

For the reference, original C++ implementation: https://github.com/microsoft/LightGBM/blob/9558417a12dec04b2ee9ba1e3f8be40521b33642/src/io/tree.cpp#L345-L359.

The needed info is available for model.booster_.save_model(), not .dump_model() https://github.com/BayesWitnesses/m2cgen/blob/916552eaf5806a088803ddb3f42c35b24858b6f9/m2cgen/assemblers/boosting.py#L139

Tree=1
num_leaves=4
num_cat=3
split_feature=12 12 0
split_gain=10774.4 2219.08 1572.71
threshold=0 1 2
decision_type=9 9 9
left_child=1 -1 -2
right_child=2 -3 -4
leaf_value=0.79976605790571909 -0.17181378682228654 0.13953785120708045 -0.6826773548752747
leaf_weight=117 155 112 122
leaf_count=117 155 112 122
internal_value=0 4.96209 -4.12762
internal_weight=0 229 277
internal_count=506 229 277
cat_boundaries=0 1 2 3
cat_threshold=3064 120 11
shrinkage=0.1

Here is Go implementation: https://github.com/dmitryikh/leaves/blob/43f8378db1939e1cb688a743281558df0abfc66c/lgtree.go#L49-L65

Nov 28 '19 00:11 StrikerRUS

Any progress on this? In combination with the introduction of categorical variables from pandas, there's a need more than ever for having the ability to transpile LightGBM models trained with categorical features. It really does negatively impact performance having to "downgrade" categorical features in a LightGBM model to either one-hot encoded or integer-mapped features. Even if it's not super efficient, having some basic functionality around this would be very helpful.

Thanks for all the great work y'all are doing on this package! Incredibly useful.

Jul 28 '20 14:07 scottfleming

m2cgen m2cgen copied to clipboard

Support for Categorical Variables in LightGBM

m2cgen
m2cgen copied to clipboard