evalml Handle Oversampler for partial dependence fast mode

Handle Oversampler for partial dependence fast mode

Open tamargrey opened this issue 2 years ago • 0 comments

In the initial implementation of partial dependence fast mode in #3753, the Oversampler was causing a few pipelines to produce different partial dependence results in fast mode. After understanding why this was happening, I decided to block Oversampler from the initial implementation. In this issue, we should determine whether the Oversampler should continue to be blocked, and if we decide to allow it, make sure to handle it properly.

What was happening that caused me to block the oversampler was this:

The Oversampler is dependent on multiple columns. If you call Oversampler.fit_transform on a single column and then call it again on a dataframe containing that column and another one, the oversampled values will be different between the two transformations. The difference in values, however, doesn’t seem to have a huge impact on partial dependence results — in the vast majority of pipelines where the oversampler is present, the PD results aren’t different at all.
When the oversampler producing different results is paired with the StandardScaler, this difference in oversampled values is magnified, because it changes the way values are scaled, meaning that there is a small impact on every single value in X. The partial dependence values then end up being different in fast mode, though usually only by .0001.

The fact that the different oversampled values produce the same PD results most of the time and, in the times they're different, are only off by .0001, meant I was tempted to say that the fact that the Oversampler is dependent on multiple columns has a negligible impact on PD results and we should keep it. Given how many pipelines will get excluded by blocking the Oversampler, I still kind of feel this way. But an important tenant of "fast mode" is that we can't use components that are reliant on multiple columns, so I think including the Oversampler should wait until we see a need for it, and then we can determine if the different results are close enough in value to keep or what we can do to allow its use.

If we do decide to include the Oversampler, several extra handlings will need to be in place:

The Oversampler needs the correct y_training - we cannot used a mocked out y for fitting cloned pipelines. It's likely that we've started including this anyway in fast mode, so no action here may be needed.
Instantiated Oversamplers sometimes have the categorical_features parameter set after fitting, and I ran into an error when cloning and refitting on the single column which will not have the same set of categorical_features. To handle that, the categorical_features parameter will need to be removed from the pipeline parameters in a _handle_partial_dependence_fast_mode method on the Oversampler.
The Oversampler method determines which sampler is best to use by looking at all of the columns in Oversampler._get_best_oversampler during fit. It's possible that the whole X may have some categorical columns, causing SMOTEN to be used for the whole X. But if we're calculating partial dependence for a numeric feature, SMOTE would be selected when refitting on the single column. We either need to decide that this is okay or
Remove _can_be_used_for_fast_partial_dependence = False from the component and update the test in test_components.py

Code to reproduce different results with different columns:

    X, y = X_y_categorical_classification
    X = X.fillna({"Cabin": "C85", "Embarked": "S", "Age": 20})

    oversampler = Oversampler(sampling_ratio=1, random_seed=0)
    X_t, y_t = oversampler.fit_transform(X, y)

    oversampler_single = Oversampler(sampling_ratio=1, random_seed=0)
    X_single = X[["Pclass"]]
    X_t_single, y_t_single = oversampler_single.fit_transform(X_single, y)

    assert len(X_t) == len(X_t_single)
    assert X_t.loc[len(X_t) - 1, "Pclass"] != X_t_single.loc[len(X_t) - 1, "Pclass"]

Nov 01 '22 14:11 tamargrey

evalml evalml copied to clipboard

Handle Oversampler for partial dependence fast mode

evalml
evalml copied to clipboard