yellowbrick plotting different imputing strategy

plotting different imputing strategy

Open greatsharma opened this issue 4 years ago • 11 comments

Note: The idea is inspired from a lecture of andreas muller.

Describe the solution you'd like The idea is to have a visual look on how closely a particular imputer imputes given feature columns.

Is your feature request related to a problem? Please describe. It gives a quick and good visual representation about how different imputation strategy works for the given feature columns of the data.

Examples In the below image I took the iris data and added nan to it across various rows. Then I construct a function which plots on how various imputation strategies impute the given 2 columns col1 and col2 (in case of iris I used petal length and petal width). For iris I used 3 different imputation strategies mentioned in the image.

plot_imputation

The code I used for this visualization is below( note, for now this code is just for demonstration purpose and it can be improved ),

def get_full_and_nan_rows(X, col1, col2):
    """
    returns 2 lists,
    full_rows, which contains the indices of non-nan rows along given 2 columns.
    nan_rows, which contains the indices of nan rows along given 2 columns.
    """
    full_rows = []
    nan_rows = []

    for ind, row in enumerate(X):
        if any(np.isnan(row[[col1, col2]])):
            nan_rows.append(ind)
        else:
            full_rows.append(ind)

    return full_rows, nan_rows


@ignore_warnings(category=ConvergenceWarning)
def plot_2D_imputation(X, y, col1, col2, imputer, xlabel='', ylabel='', title='', figsize=(5,5), alpha=0.6, s=80):
    
    full_rows, nan_rows = get_full_and_nan_rows(X, col1, col2)
    X_imp = imputer.fit_transform(X)

    ax.scatter(X_imp[full_rows, col1], X_imp[full_rows, col2], c=plt.cm.tab10(
        y[full_rows]), alpha=alpha, s=s, marker='o')
    ax.scatter(X_imp[nan_rows, col1], X_imp[nan_rows, col2], c=plt.cm.tab10(
        y[nan_rows]), alpha=alpha, s=s, marker='s')
    ax.set_xlabel(xlabel)
    ax.set_ylabel(ylabel)
    ax.set_title(title)

Jan 25 '20 10:01 greatsharma

@bbengfort any comment on this?

Jan 27 '20 11:01 greatsharma

Hello @greatsharma and thanks for checking out Yellowbrick! @bbengfort and I are both currently traveling, so it may take us a week or more to respond. We appreciate your patience and your feature suggestion!

Jan 27 '20 12:01 rebeccabilbro

@greatsharma this seems like an interesting idea and we'd love to see a prototype of this. Are you interested in working on this visualizer?

Feb 26 '20 14:02 bbengfort

@bbengfort I am good to go with this idea. Can you elaborate what should be included in a prototype, I mean pseudo code or complete API like thing

Feb 27 '20 04:02 greatsharma

Right now I'm thinking of a more thorough proposal with pseudo code and workflows. For example, one of my concerns about this proposal is that it requires multiple axes similar to the splom plot -- if a user has dozens or hundreds of features, this is unfeasible. So does the user get to choose which features are plotted against? Does the user specify the imputer, does the visualizer use one imputer and multiple visualizers are used for each imputer? What other metrics can we use and show in the figure, etc? Does that make sense?

Feb 27 '20 16:02 bbengfort

Note - My first look to the feature is as follows. Also I am not an expert in ploting so any
suggestions/improvements are highly appreciated.

workflow User will be exposed to plot_2D_imputation (or any other name you suggest, also 2D because we can extend the idea to 3D plots but I don't suggest that because 3D imputation plots are hard to interpret). The function will take following params :

X (data matrix, using this we will infer the nan and full rows)
y (needed for classification problem only, for coloring classes differently)
col1 & col2 (columns needed to visualize)
imputer (strategy used to impute)
other optional params for visualization purpose

internally I used get_full_and_nan_rows function which returns full_rows & nan_rows which latter used to plot imputed and non-imputed data-points with different markers. for eg. in above example I used marker='o' for full_rows and marker='s' for nan_rows.

Also maybe we can create separate functions for classification and regression.

pseudo code The code I used for the visualization is mentioned above. The above visualization is focused on classification problem(because I am coloring on class labels) but can be easily extended to regression.

So does the user get to choose which features are plotted against? Yes, as mentioned in the above code, plot_2D_imputation function takes col1 & col2 as arguments which are the index of columns which user needs to visualize. Also if we go with the splom way then as you told it will be unfeasible as features increases.

Does the user specify the imputer? Yes, user will pass imputer as an argument to the plot_2D_imputation function.

Does the visualizer use one imputer? Although in the above code I allowed only one imputer at a time, but it can be easily extended to multiple imputers at a time by passing list of imputer as argument to the imputer param of plot_2D_imputation function.

multiple visualizers are used for each imputer? Question is not entirely clear to me but in the above code I used one scatter-plot for one imputer.

What other metrics can we use and show in the figure? For the case of SimpleImputer we can show the statistic values used for imputation, eg if mean imputation is used then we can show the column means used for imputation same for median & mode. You can also suggest on this.

Mar 03 '20 07:03 greatsharma

@pdamodaran would you mind taking a look at the suggestions by @greatsharma and seeing what you think?

Apr 09 '20 11:04 bbengfort

@bbengfort - sure! I will review this over the weekend.

Apr 10 '20 01:04 pdamodaran

This is quite an interesting idea for a visualizer @greatsharma and I think you are on the right track for building this visualizer.

I am also on the same page as you and @bbengfort in that we should limit the number of features and I think we should stick with two features.

In order to avoid the multiple axes problem, I think we should stick to one imputer that the user can specify but the visualizer can use the SimpleImputer with the "mean" strategy as the default if an imputer is not passed.

I like the idea of using different markers to differentiate between missing and non-missing values and using the target values to differentiate the data points for classification problems. I also think that if the SimpleImputer is used, then the statistic values could be overlayed on the chart as dashed lines along the x and y axis.

Other things to consider: - this would only work for numeric values, there will need to be functionality to output an error message if non-numeric values are passed in. - the utility function to get the "full" and "nan" rows assumes that "nan" is the only way to indicate missing values. This function would also need to account for the other methods such as blank values.

Regarding the idea of passing in multiple imputers, I think we could possibly create a different visualizer whose purpose is to compare the performance of the imputing techniques by passing in an estimator and using cross_val_score to determine which imputer technique worked the best. Refer to the following link for an example: https://scikit-learn.org/stable/auto_examples/impute/plot_missing_values.html#sphx-glr-auto-examples-impute-plot-missing-values-py

@bbengfort - please chime in if there is anything I missed or if you had other thoughts. @greatsharma - let me know if you have any questions and thanks again for checking out Yellowbrick and presenting your idea.

Apr 12 '20 19:04 pdamodaran

@pdamodaran I am with you on this and good to go once we got thumbs up from @bbengfort

May 04 '20 10:05 greatsharma

Hi @greatsharma sorry, it's taken me so long to respond - my GitHub emails got pretty buried. In principle, I'm fine with the approach that you mentioned. My only comment is to remove the plot_2d from the function name, so far we've chosen to pass 2d or 3d as a parameter to visualizers that do 2d or 3d visualization (see the PCA visualizer for an example). And if the 2d is removed, then plot becomes redundant.

We would be interested in seeing some prototypes of this suggestion as a next step!

Jun 10 '20 13:06 bbengfort

yellowbrick yellowbrick copied to clipboard

plotting different imputing strategy

yellowbrick
yellowbrick copied to clipboard