Cubist icon indicating copy to clipboard operation
Cubist copied to clipboard

model.rules_ returns None for 1 rule and 1 committee

Open Paulnkk opened this issue 1 year ago • 7 comments

Hey,

If I train cubist with the following lines of code:

def train_cubist(X_train, y_train, number_of_committees, number_of_rules):
    model = Cubist(verbose=False, n_committees=number_of_committees, n_rules=number_of_rules)
    model.fit(X_train, y_train)

    return model.rules_, model.coeff_, model

and a single linear regression is returned (for instance when the sample dataset is very small), I do not think that it is the best way to return None for model.rules_, probably there is a better way to return something else ?

I am happy to work on a PR if we find a better way to do so

Thank you and best regards,

Paul

Edit:

I am not sure if Cubist works theoretically like that it is possible to generate 1 committee and 1 rule, probably one suggestion from my side could be the option to introduce global bounds derived from the dataset with rule1: x <= max(X) and rule2: x > min(X) with max(X) as the max value from the dataset and min(X) min value from the dataset respectively. This would be consistent with the structure of the data in the case where the number of rules is bigger than 1.

Paulnkk avatar Jul 18 '24 12:07 Paulnkk

Been looking at this. It looks like for datasets with 5 or fewer samples, Cubist just returns the average of the target variable. It does appear that for models with a single linear regressor, it's returning nothing for the rules_ attribute but is correct for the coeffs_ attribute. Not sure if this is also present in the R code but I'll keep looking into this and work on a fix unless you beat me to it

pjaselin avatar Jul 26 '24 18:07 pjaselin

Notes:

  • Learned that categorical variables (strings) do work though.
  • Q: Can Cubist do predictions on categorical only data? A: No, it just returns the average of the output
  • Q: What does the model report for multiple committees vs one? A:

pjaselin avatar Jul 26 '24 19:07 pjaselin

@pjaselin thanks for the response! I mainly utilize the following function to fix this issue:

def generate_rules_data(coeff_data, X):
    # Check if 'rule' column in coeff_data has only one unique value which is 1
    if coeff_data['rule'].nunique() == 1 and coeff_data['rule'].unique()[0] == 1:
        duplicated_coeff_data = coeff_data.copy()
        duplicated_coeff_data['rule'] = 2
        coeff_data = pd.concat([coeff_data, duplicated_coeff_data], ignore_index=True)

        # Initialize the new DataFrame
        rules_data = pd.DataFrame(columns=['committee', 'rule', 'variable', 'dir', 'value'])

        # Loop over the 'committee' column values in coeff_data
        for c in coeff_data['committee'].unique():
            # Create pairs (c, 1) and (c, 2)
            pairs = [(c, 1), (c, 2)]

            # Extract min and max values from the dataset X for each column
            for col in X.columns:
                min_val = X[col].min()
                max_val = X[col].max()

                # Construct the rule data
                rule1 = {'committee': c, 'rule': 1, 'variable': col, 'dir': '>', 'value': min_val}
                rule2 = {'committee': c, 'rule': 2, 'variable': col, 'dir': '<=', 'value': max_val}

                # Append the rules to the DataFrame
                rules_data = pd.concat([rules_data, pd.DataFrame([rule1])], ignore_index=True)
                rules_data = pd.concat([rules_data, pd.DataFrame([rule2])], ignore_index=True)

        return rules_data, coeff_data

when calling:

model = Cubist(verbose=False, n_committees=number_of_committees, n_rules=number_of_rules)
    model.fit(X_train, y_train)

    if model.coeff_['rule'].nunique() == 1 and model.coeff_['rule'].unique()[0] == 1:
        rules_data, coeff_data = generate_rules_data(model.coeff_, X_train)
        return rules_data, coeff_data, model

Here I 'artificially' create rules based on the max and min values in the dataset (the single linear regression model is defined on the whole feature space), do you think this is the correct procedure? If you agree, I can start a PR and try to integrate this change

Paulnkk avatar Jul 30 '24 05:07 Paulnkk

It looks like the actual solution would be to rewrite the _parse_model function. This is what the model output actually looks like for a single regressor:

id="Cubist 2.07 GPL Edition 2024-08-03"
prec="0" globalmean="9.5" extrap="0.05" insts="0" ceiling="19.95" floor="0"
att="outcome" mean="9.5" sd="5.916081" min="0" max="19"
att="x" mode="A"
entries="1"
rules="1"
conds="0" cover="20" mean="9.5" loval="0" hival="19" esterr="5.5"
coeff="10"

This actually looks different from when there are multiple models/committees so I want to change how this works. Likely start by characterizing whether it's a single model or multiple and then go from there. I can probably reduce the logic currently used. Not that it has too much overhead but just adds more steps. Hope this makes sense.

pjaselin avatar Aug 03 '24 12:08 pjaselin

The code currently relies on a type parameter to parse out the rules but that's not always present so we have to work around that.

pjaselin avatar Aug 03 '24 12:08 pjaselin

It looks like the actual solution would be to rewrite the _parse_model function. This is what the model output actually looks like for a single regressor:

id="Cubist 2.07 GPL Edition 2024-08-03"
prec="0" globalmean="9.5" extrap="0.05" insts="0" ceiling="19.95" floor="0"
att="outcome" mean="9.5" sd="5.916081" min="0" max="19"
att="x" mode="A"
entries="1"
rules="1"
conds="0" cover="20" mean="9.5" loval="0" hival="19" esterr="5.5"
coeff="10"

This actually looks different from when there are multiple models/committees so I want to change how this works. Likely start by characterizing whether it's a single model or multiple and then go from there. I can probably reduce the logic currently used. Not that it has too much overhead but just adds more steps. Hope this makes sense.

@pjaselin Thanks a lot for fixing this issue; what should be returned in the R output ? How can I understand what output should be stored in model.rules_ ? What are actually the rules defined in this case ? Since if we return the average of label values for less than 5 samples and we try to create rules for this situation, shouldn't be the rules defined in the way so that we covere the whole dataset ? Like min and max values on the feature space for each dimension

Paulnkk avatar Aug 04 '24 15:08 Paulnkk

It's actually the C output and that's the string returned. The problem is that when multiple rules/committees are involved, the string looks different. I'd have to check if the original R code has the same problem (only parses multiple rules and not handling one). You can see from the linked PR that I've started working on it but need to walk back and expand on the original logic. I'll take a read through the R code again.

pjaselin avatar Aug 05 '24 11:08 pjaselin

@Paulnkk Can you take a look at the output of the code from the branch connected to this PR and tell me what you think? I'm basically returning a very small table indicating that there is only one rule

pjaselin avatar Sep 01 '24 02:09 pjaselin

I'm also parsing out the rest of the model features as attributes to make that more useful

pjaselin avatar Sep 01 '24 22:09 pjaselin

Planning on closing this and releasing tomorrow if I have the chance and you can look back after

pjaselin avatar Sep 01 '24 23:09 pjaselin