evalml
evalml copied to clipboard
Handle Oversampler for partial dependence fast mode
In the initial implementation of partial dependence fast mode in #3753, the Oversampler was causing a few pipelines to produce different partial dependence results in fast mode. After understanding why this was happening, I decided to block Oversampler
from the initial implementation. In this issue, we should determine whether the Oversampler should continue to be blocked, and if we decide to allow it, make sure to handle it properly.
What was happening that caused me to block the oversampler was this:
- The Oversampler is dependent on multiple columns. If you call
Oversampler.fit_transform
on a single column and then call it again on a dataframe containing that column and another one, the oversampled values will be different between the two transformations. The difference in values, however, doesn’t seem to have a huge impact on partial dependence results — in the vast majority of pipelines where the oversampler is present, the PD results aren’t different at all. - When the oversampler producing different results is paired with the
StandardScaler
, this difference in oversampled values is magnified, because it changes the way values are scaled, meaning that there is a small impact on every single value in X. The partial dependence values then end up being different in fast mode, though usually only by .0001.
The fact that the different oversampled values produce the same PD results most of the time and, in the times they're different, are only off by .0001, meant I was tempted to say that the fact that the Oversampler is dependent on multiple columns has a negligible impact on PD results and we should keep it. Given how many pipelines will get excluded by blocking the Oversampler, I still kind of feel this way. But an important tenant of "fast mode" is that we can't use components that are reliant on multiple columns, so I think including the Oversampler should wait until we see a need for it, and then we can determine if the different results are close enough in value to keep or what we can do to allow its use.
If we do decide to include the Oversampler, several extra handlings will need to be in place:
- The Oversampler needs the correct
y_training
- we cannot used a mocked out y for fitting cloned pipelines. It's likely that we've started including this anyway in fast mode, so no action here may be needed. - Instantiated Oversamplers sometimes have the
categorical_features
parameter set after fitting, and I ran into an error when cloning and refitting on the single column which will not have the same set ofcategorical_features
. To handle that, thecategorical_features
parameter will need to be removed from the pipeline parameters in a_handle_partial_dependence_fast_mode
method on the Oversampler. - The Oversampler method determines which sampler is best to use by looking at all of the columns in
Oversampler._get_best_oversampler
during fit. It's possible that the whole X may have some categorical columns, causingSMOTEN
to be used for the whole X. But if we're calculating partial dependence for a numeric feature,SMOTE
would be selected when refitting on the single column. We either need to decide that this is okay or - Remove
_can_be_used_for_fast_partial_dependence = False
from the component and update the test intest_components.py
Code to reproduce different results with different columns:
X, y = X_y_categorical_classification
X = X.fillna({"Cabin": "C85", "Embarked": "S", "Age": 20})
oversampler = Oversampler(sampling_ratio=1, random_seed=0)
X_t, y_t = oversampler.fit_transform(X, y)
oversampler_single = Oversampler(sampling_ratio=1, random_seed=0)
X_single = X[["Pclass"]]
X_t_single, y_t_single = oversampler_single.fit_transform(X_single, y)
assert len(X_t) == len(X_t_single)
assert X_t.loc[len(X_t) - 1, "Pclass"] != X_t_single.loc[len(X_t) - 1, "Pclass"]