scikit-learn [RFC] Varying the number of outputs considered for splitting in Multi Output Decision Trees

trafficstars

Describe the workflow you want to enable

One strength of RFRs is that they are incredibly robust and therefore provide a strong baseline for many tasks without needing to consider normalization or scaling of either the inputs or outputs. In the case of multi-output RFRs this robustness towards the output space goes away due to the summing of impurities across different output dimensions which entails the need to standardize the output labels to ensure that undue attention isn't given to particular outputs. Currently the documentation doesn't readily inform the user of this artifact. In the spirit of the Random Forest one solution to avoiding this problem would be to randomly sample which output(s) to consider for the determination of the split. If the number of outputs was set to 1 then we would end up with a case where the normalization of the output space once again doesn't matter.

Describe your proposed solution

The introduction of a new kwarg max_outputs (by analogy to max_features) could allow users to control how many outputs were considered when selecting the optimal split. If set to 1.0 all outputs would be used as currently, if set to 1 then a single output would be used as described above. This seems like a relevant and natural hyper-parameter for the multi-output RFR.

Describe alternatives you've considered, if relevant

No response

Additional context

I have not been able to find literature that explores the above slight adjustment to the current algorithm. This is a RFC to see if the team would accept such a PR in principle without the quoted 200+ citation requirement if sufficient empirical evidence was provided.

Dec 01 '23 00:12 CompRhys

Could you please post a minimal reproducible showing this issue?

Dec 01 '23 16:12 adrinjalali

That's an interesting suggestion. @glouppe are you aware of any reference in the literature that study this potential source of bias?

Dec 01 '23 16:12 ogrisel

I was trying to reproduce the code, was this what you had in mind? I was trying to artificially increase the MSE of y1, but I could not consistently raise it.

@CompRhys , was this what you had in mind for reproducible code, where we try to get a significantly different value for mse_output1? Should I try more features? Thank you.

import numpy as np
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

# Set random seed for reproducibility
np.random.seed(42)

# Generate a synthetic dataset with two features and two output dimensions
# The second output dimension has a larger scale
X = np.random.rand(100, 2)
y1 = 2 * X[:, 0] + 3 * X[:, 1] + np.random.normal(scale=0.1, size=100)
y2 = 20 * X[:, 0] + 30 * X[:, 1] + np.random.normal(scale=1, size=100)
y3 = 20 * X[:, 0] + 30 * X[:, 1] + np.random.normal(scale=1, size=100)
y4 = 20 * X[:, 0] + 30 * X[:, 1] + np.random.normal(scale=1, size=100)
y5 = 20 * X[:, 0] + 30 * X[:, 1] + np.random.normal(scale=1, size=100)
y6 = 20 * X[:, 0] + 30 * X[:, 1] + np.random.normal(scale=1, size=100)

y = np.column_stack((y1, y2,y3,y4,y5,y6))

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create a multi-output RandomForestRegressor
rf_regressor = RandomForestRegressor(n_estimators=100, random_state=42)

# Fit the model
rf_regressor.fit(X_train, y_train)

# Make predictions
y_pred = rf_regressor.predict(X_test)

# Evaluate the model performance on each output dimension
mse_output1 = mean_squared_error(y_test[:, 0], y_pred[:, 0])
mse_output2 = mean_squared_error(y_test[:, 1], y_pred[:, 1])
mse_output3 = mean_squared_error(y_test[:, 1], y_pred[:, 1])
print("Mean Squared Error (Output 1):", mse_output1)

Also, while I could not find anything related specifically towards the optimization increases related to selectively choosing MSE on each partition, I did find this paper that mentions some optimizations that could be implemented towards Random Forest.

https://arxiv.org/pdf/2309.01460.pdf

Dec 28 '23 18:12 Higgs32584

Here's a slightly more involved setup to play with. I think your issue might have been that your functions are all strongly rank correlated and so the splits locations won't vary as you vary which output dominates the impurity.

# %%
import numpy as np
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from scipy.stats import spearmanr

from sklearn.datasets import make_friedman1
from sklearn.utils.validation import check_random_state

from sklearn.compose import TransformedTargetRegressor
from sklearn.preprocessing import MinMaxScaler

# %%
# Generate a synthetic dataset with two features and two output dimensions
# The second output dimension has a larger scale
PRNG_SEED = 42
n_estimators = 100
n_samples = 100
base_noise = 0.1
f2_scale = 1
f2_noise_scale = 1
f3_scale = 10
f3_noise_scale = 10

# Generate the dataset
X, y_friedman1 = make_friedman1(n_samples=n_samples, noise=base_noise, random_state=PRNG_SEED)

generator = check_random_state(PRNG_SEED)

y_friedman2 = f2_scale * ((
        X[:, 0] ** 2 + (X[:, 1] * X[:, 2] - 1 / (X[:, 1] * X[:, 3])) ** 2
    ) ** 0.5 + f2_noise_scale * generator.standard_normal(size=(n_samples)))

y_friedman3 = f3_scale * (np.arctan(
        (X[:, 1] * X[:, 2] - 1 / (X[:, 1] * X[:, 3])) / X[:, 0]
    ) + f3_noise_scale * generator.standard_normal(size=(n_samples)))

# Check the correlation between the output dimensions
print(f"Correlation between output 1 and output 2: {spearmanr(y_friedman1, y_friedman2)}")
print(f"Correlation between output 1 and output 3: {spearmanr(y_friedman1, y_friedman3)}")
print(f"Correlation between output 2 and output 3: {spearmanr(y_friedman2, y_friedman3)}")

y = np.column_stack((y_friedman1, y_friedman2, y_friedman3))

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# %% 
rf_regressor = RandomForestRegressor(n_estimators=n_estimators, random_state=PRNG_SEED)

# %%
# Create a single-output RandomForestRegressors
# Fit the model for Output 0
rf_regressor.fit(X_train, y_train[:, 0])
y_pred = rf_regressor.predict(X_test)
mse_output1 = mean_squared_error(y_test[:, 0], y_pred)
print("Mean Squared Error (Output 0):", mse_output1)

# Fit the model for Output 1
rf_regressor.fit(X_train, y_train[:, 1])
y_pred = rf_regressor.predict(X_test)
mse_output2 = mean_squared_error(y_test[:, 1], y_pred)
print("Mean Squared Error (Output 1):", mse_output2)

# Fit the model for Output 2
rf_regressor.fit(X_train, y_train[:, 2])
y_pred = rf_regressor.predict(X_test)
mse_output3 = mean_squared_error(y_test[:, 2], y_pred)
print("Mean Squared Error (Output 2):", mse_output3)

# %%
# Create a multi-output RandomForestRegressor
# Fit the model
rf_regressor.fit(X_train, y_train)

# Make predictions
y_pred = rf_regressor.predict(X_test)

# Evaluate the model performance on each output dimension
mse_output1 = mean_squared_error(y_test[:, 0], y_pred[:, 0])
mse_output2 = mean_squared_error(y_test[:, 1], y_pred[:, 1])
mse_output3 = mean_squared_error(y_test[:, 2], y_pred[:, 2])
print("Mean Squared Error (Output 0):", mse_output1)
print("Mean Squared Error (Output 1):", mse_output2)
print("Mean Squared Error (Output 2):", mse_output3)

# %%
# Create a multi-output RandomForestRegressor with a scaling on the output
tt_rfr = TransformedTargetRegressor(
    regressor=RandomForestRegressor(n_estimators=n_estimators, random_state=PRNG_SEED),
    transformer=MinMaxScaler(),
)

# Fit the model
tt_rfr.fit(X_train, y_train)

# Make predictions
y_pred = tt_rfr.predict(X_test)

# Evaluate the model performance on each output dimension
mse_output1 = mean_squared_error(y_test[:, 0], y_pred[:, 0])
mse_output2 = mean_squared_error(y_test[:, 1], y_pred[:, 1])
mse_output3 = mean_squared_error(y_test[:, 2], y_pred[:, 2])
print("Mean Squared Error (Output 0):", mse_output1)
print("Mean Squared Error (Output 1):", mse_output2)
print("Mean Squared Error (Output 2):", mse_output3)

Dec 28 '23 19:12 CompRhys

@CompRhys

Ok, I guess this would be your reproducible code, I will play around with it a little as well.

For max outputs how would you go about choosing which outputs were considered when selecting the optimal split?

Dec 28 '23 19:12 Higgs32584

https://github.com/scikit-learn/scikit-learn/blob/main/sklearn/tree/_tree.pyx https://github.com/scikit-learn/scikit-learn/blob/678e3999eeaadcaaef523d1d0d2f52a25986d460/sklearn/tree/_splitter.pxd https://github.com/scikit-learn/scikit-learn/blob/678e3999eeaadcaaef523d1d0d2f52a25986d460/sklearn/tree/_criterion.pxd

So you need to familiarise yourself with these files. I haven't gone through it in depth myself (no time) but somewhere in there there should be logic that is selecting which features to consider when determining the split. You'll want to add some logic that will sample which outputs to consider when calculating the impurity. There's a method node_impurity() that is the thing you'll probably want to edit.

The proposal here is to add a layer of randomization that randomly draws a subset of the output features for calculating the impurity, currently the impurity is just the sum over all the output dimensions. This would be controlled by the max_outputs argument to tree based models described in the top post.

Dec 28 '23 19:12 CompRhys

@CompRhys ok I'll look into it. Thank you

Dec 28 '23 19:12 Higgs32584

Hey update, I started to look deeper into implementation, I believe the logic you are looking for is located here: scikit-learn/sklearn/tree/_splitter.pyx#L1562 sklearn/tree/_splitter.pyx#L1598

I will be looking into this further. If anyone gets it before I do feel free to write your own pull request. Once I have a reasonable draft I will draft a pull request. Might take a minute it is tricky logic.

Jan 16 '24 19:01 Higgs32584

@ogrisel should I proceed with this issue, would you like to see it implemented? Or do we need a better example? I want to wait until the reproducible code sign is off so I know you'll want this

Jan 19 '24 16:01 Higgs32584

scikit-learn scikit-learn copied to clipboard

[RFC] Varying the number of outputs considered for splitting in Multi Output Decision Trees

Describe the workflow you want to enable

Describe your proposed solution

Describe alternatives you've considered, if relevant

Additional context

scikit-learn
scikit-learn copied to clipboard