rulefit icon indicating copy to clipboard operation
rulefit copied to clipboard

Redundant rules in linear model

Open lukauskas opened this issue 5 years ago • 3 comments

The current implementation of rulefit can sometimes produce redundant features that are then fed into the lasso. This comes from the stochastic nature of random trees and lack of rule pruning.

To illustrate this let's have a look at the Boston example in the readme:

import numpy as np
import pandas as pd
np.random.seed(42)

from rulefit import RuleFit

boston_data = pd.read_csv("boston.csv", index_col=0)

y = boston_data.medv.values
X = boston_data.drop("medv", axis=1)
features = X.columns
X = X.as_matrix()

rf = RuleFit()
rf.fit(X, y, feature_names=features)

The inner workings of RuleFit rely on L1-penalised linear model which acts on the rules-transformed input variable, i.e.:

transformed_x = pd.DataFrame(rf.transform(X), 
                             columns=list(rf.rule_ensemble.rules))

Now what happens (even with the Boston dataset) is that this transformed_x contains columns that are in all purposes duplicated, given the training set. This result comes from the different decision trees having slightly different splits that end up having the same effect on X. To see this we can construct a hashtable of rules based on transformed_x values:

equivalence_classes = {}


for rule, col in transformed_x.T.iterrows():
    
    # Tuple is hashable
    key_pos = tuple(col)
    
    try:
        equivalence_classes[key_pos].append(rule)
    except KeyError:
        equivalence_classes[key_pos] = [rule]
        
redundant_rules = {k: v for k, v in equivalence_classes.items() if len(v) > 1}
for v in redundant_rules.values():
    print('Redundant rules')
    for vv in v:
        print('{}: support={:.3f}, prediction={:.3f}'.format(vv, vv.support, vv.prediction_value))

Out of the groups output some are quite obviously redundant, such as:

lstat <= 4.630000114440918: support=0.107, prediction=7.044
lstat <= 4.664999961853027: support=0.107, prediction=7.654
lstat <= 4.644999980926514: support=0.107, prediction=4.838
lstat <= 4.650000095367432: support=0.073, prediction=12.706

Other are perhaps less intuitive:

nox <= 0.6569999754428864 & dis <= 1.3727499842643738 & rm <= 7.648000001907349: support=0.009, prediction=13.002
nox <= 0.6569999754428864 & dis <= 1.3980499505996704: support=0.013, prediction=14.541
crim <= 17.186299800872803 & rm > 7.797999858856201: support=0.030, prediction=4.344
rm > 6.850500106811523 & rm > 7.7834999561309814: support=0.056, prediction=13.445

I guess the difference in support comes from random subsampling during gradient boosting.

This makes the linear model badly specified, and will cause model instability the in the best case, as L1 will keep picking one of the equivalent groups at random.

To solve this one would probably need to merge these rules. I'm happy to give a PR that does this, but I think it needs to be discussed on what would be the best way to approach this.

The logical option is probably an OR operator. This would handle the cases where the rules might not be equivalent in other datasets that are not training sets.

The question, is of course what to do with the support and prediction values during the merge. If I understand the code correctly the prediction_value is not used anywhere. Can support be recalculated at linear model step?

lukauskas avatar Feb 12 '19 17:02 lukauskas

As an aside, if my intuition is true, the following formulations are also equivalent where a and b are two rule-transformed columns and ws are the weights.

y = w0 + w1 a + w2 b

and

y = w0 + w1 a + w2 (1-b)  = w0 + w1 a + w2 - w2b = (w0+w2) + w1 a - w2 b 

So in the end rules A and B are also equivalent in terms of linear model if transform(A) = 1 - transform(B) , i.e. the negated rule is equivalent to itself.

This is true when the linear model contains an intercept (which is the default in RuleFit). Depending on the research question these may not be desired in cases where intercept is not used.

Such positive-negative redundancies can be found in a similar way:

equivalence_classes_with_negative = {}

for rule, col in transformed_x.T.iterrows():
        
    key_pos = tuple(col)
    key_neg = tuple(1 - col)
    
    try:
        equivalence_classes_with_negative[key_pos].append(('+', rule))
    except KeyError:
        equivalence_classes_with_negative[key_pos] = [('+', rule)]
        
    try:
        equivalence_classes_with_negative[key_neg].append(('-', rule))
    except KeyError:
        equivalence_classes_with_negative[key_neg] = [('-', rule)]

redundant_rules_with_neg = {k: v for k, v in equivalence_classes_with_negative.items() if len(v) > 1}
for v in redundant_rules_with_neg.values():
    print('Redundant rules')
    for vv in v:
        print('{}: support={:.3f}, prediction={:.3f}'.format(vv, vv[1].support, vv[1].prediction_value))

we get rule groups like this:

('+', tax <= 278.0): support=0.256, prediction=2.626
('-', tax > 278.0): support=0.744, prediction=-0.623

Arguably we would need only one of these features in the linear model (as the other can be absorbed into the intercept). I'm pretty sure this is a special case of the issue of constructing linearly-independent categorical encoding that patsy tries to solve

I wonder if something like this should be implemented here as well.

lukauskas avatar Feb 12 '19 18:02 lukauskas

Very nice observation. Lasso does tend to randomly pick a feature (rule) in the case of collinearity!

For the OR option, do you mean carry all equivalent rules say R1 R2 R3 into a composite R*=(R1 or R2 or R3)?

Would a simpler solution be to whip out Occams razor and simply pick the simplest rule (lowest number of conditions) - or pick at random if they are equal? Or do you think it is important to let the user see the options for equivalent rules...

chriswbartley avatar Feb 13 '19 07:02 chriswbartley

My concern is that simplest rule might overgeneralise on the training dataset. In theory there is a small chance that the additional rules that are redundant in the training dataset might capture some extra variation in the test datasets. In this case OR could capture this.

For instance, for something like the following dataset of (x,y) pairs:

[(1, 2), (3, 4), (5,6)]

The following rules are redundant

R1: x > 1
R2: x <= 5 and y>3

But for a hypothetical test observation (0,4) this will no longer be the case. Simplest rule, R1, will have a zero for this dataset. R1 or R2 should correctly capture this combination as 1 though.

Ideally I think something like R* = simplify(R1 or R2 or R3) would work best for such rules. In this case it is clear that

R* = (x>1 or x <=5) and (x>1 or y>3) = True and (x>1 or y<=3) = x>1 or y > 3

Not entirely sure how to implement this though. Maybe sympy?

Related to my second comment in this issue, I have been thinking about that and either I don't understand something about the decision rule generation as described in Friedman in Popescu, or it could be improved. In my opinion, only one side of the rules (condition == True, or condition == False) need to be incorporated at any given level of the tree, as including both adds no extra predictability to the linear model downstream. Let me see if I can draft a proof for this.

lukauskas avatar Feb 14 '19 10:02 lukauskas