model.rules_ returns None for 1 rule and 1 committee
Hey,
If I train cubist with the following lines of code:
def train_cubist(X_train, y_train, number_of_committees, number_of_rules):
model = Cubist(verbose=False, n_committees=number_of_committees, n_rules=number_of_rules)
model.fit(X_train, y_train)
return model.rules_, model.coeff_, model
and a single linear regression is returned (for instance when the sample dataset is very small), I do not think that it is the best way to return None for model.rules_, probably there is a better way to return something else ?
I am happy to work on a PR if we find a better way to do so
Thank you and best regards,
Paul
Edit:
I am not sure if Cubist works theoretically like that it is possible to generate 1 committee and 1 rule, probably one suggestion from my side could be the option to introduce global bounds derived from the dataset with rule1: x <= max(X) and rule2: x > min(X) with max(X) as the max value from the dataset and min(X) min value from the dataset respectively. This would be consistent with the structure of the data in the case where the number of rules is bigger than 1.
Been looking at this. It looks like for datasets with 5 or fewer samples, Cubist just returns the average of the target variable. It does appear that for models with a single linear regressor, it's returning nothing for the rules_ attribute but is correct for the coeffs_ attribute. Not sure if this is also present in the R code but I'll keep looking into this and work on a fix unless you beat me to it
Notes:
- Learned that categorical variables (strings) do work though.
- Q: Can Cubist do predictions on categorical only data? A: No, it just returns the average of the output
- Q: What does the model report for multiple committees vs one? A:
@pjaselin thanks for the response! I mainly utilize the following function to fix this issue:
def generate_rules_data(coeff_data, X):
# Check if 'rule' column in coeff_data has only one unique value which is 1
if coeff_data['rule'].nunique() == 1 and coeff_data['rule'].unique()[0] == 1:
duplicated_coeff_data = coeff_data.copy()
duplicated_coeff_data['rule'] = 2
coeff_data = pd.concat([coeff_data, duplicated_coeff_data], ignore_index=True)
# Initialize the new DataFrame
rules_data = pd.DataFrame(columns=['committee', 'rule', 'variable', 'dir', 'value'])
# Loop over the 'committee' column values in coeff_data
for c in coeff_data['committee'].unique():
# Create pairs (c, 1) and (c, 2)
pairs = [(c, 1), (c, 2)]
# Extract min and max values from the dataset X for each column
for col in X.columns:
min_val = X[col].min()
max_val = X[col].max()
# Construct the rule data
rule1 = {'committee': c, 'rule': 1, 'variable': col, 'dir': '>', 'value': min_val}
rule2 = {'committee': c, 'rule': 2, 'variable': col, 'dir': '<=', 'value': max_val}
# Append the rules to the DataFrame
rules_data = pd.concat([rules_data, pd.DataFrame([rule1])], ignore_index=True)
rules_data = pd.concat([rules_data, pd.DataFrame([rule2])], ignore_index=True)
return rules_data, coeff_data
when calling:
model = Cubist(verbose=False, n_committees=number_of_committees, n_rules=number_of_rules)
model.fit(X_train, y_train)
if model.coeff_['rule'].nunique() == 1 and model.coeff_['rule'].unique()[0] == 1:
rules_data, coeff_data = generate_rules_data(model.coeff_, X_train)
return rules_data, coeff_data, model
Here I 'artificially' create rules based on the max and min values in the dataset (the single linear regression model is defined on the whole feature space), do you think this is the correct procedure? If you agree, I can start a PR and try to integrate this change
It looks like the actual solution would be to rewrite the _parse_model function. This is what the model output actually looks like for a single regressor:
id="Cubist 2.07 GPL Edition 2024-08-03"
prec="0" globalmean="9.5" extrap="0.05" insts="0" ceiling="19.95" floor="0"
att="outcome" mean="9.5" sd="5.916081" min="0" max="19"
att="x" mode="A"
entries="1"
rules="1"
conds="0" cover="20" mean="9.5" loval="0" hival="19" esterr="5.5"
coeff="10"
This actually looks different from when there are multiple models/committees so I want to change how this works. Likely start by characterizing whether it's a single model or multiple and then go from there. I can probably reduce the logic currently used. Not that it has too much overhead but just adds more steps. Hope this makes sense.
The code currently relies on a type parameter to parse out the rules but that's not always present so we have to work around that.
It looks like the actual solution would be to rewrite the _parse_model function. This is what the model output actually looks like for a single regressor:
id="Cubist 2.07 GPL Edition 2024-08-03" prec="0" globalmean="9.5" extrap="0.05" insts="0" ceiling="19.95" floor="0" att="outcome" mean="9.5" sd="5.916081" min="0" max="19" att="x" mode="A" entries="1" rules="1" conds="0" cover="20" mean="9.5" loval="0" hival="19" esterr="5.5" coeff="10"This actually looks different from when there are multiple models/committees so I want to change how this works. Likely start by characterizing whether it's a single model or multiple and then go from there. I can probably reduce the logic currently used. Not that it has too much overhead but just adds more steps. Hope this makes sense.
@pjaselin Thanks a lot for fixing this issue; what should be returned in the R output ? How can I understand what output should be stored in model.rules_ ? What are actually the rules defined in this case ? Since if we return the average of label values for less than 5 samples and we try to create rules for this situation, shouldn't be the rules defined in the way so that we covere the whole dataset ? Like min and max values on the feature space for each dimension
It's actually the C output and that's the string returned. The problem is that when multiple rules/committees are involved, the string looks different. I'd have to check if the original R code has the same problem (only parses multiple rules and not handling one). You can see from the linked PR that I've started working on it but need to walk back and expand on the original logic. I'll take a read through the R code again.
@Paulnkk Can you take a look at the output of the code from the branch connected to this PR and tell me what you think? I'm basically returning a very small table indicating that there is only one rule
I'm also parsing out the rest of the model features as attributes to make that more useful
Planning on closing this and releasing tomorrow if I have the chance and you can look back after