woodwork
woodwork copied to clipboard
Handle zero-sparse columns in box_plot_dict
- As a user, I wish Woodwork's box plot method had a way of calculating outliers for columns with many zeros (zero-sparse) that didn't have any non-zero value be an outlier.
One simple implementation of this is to calculate series.ww.box_plot_dict
on a series without any zeros present. When this should be used is logic that may not need to exist in Woodwork, but it can be related to the data's variance with and without the zeros present.
Note: The use for this is when the zero is a true value itself (as opposed to it being a marker for null values, which is a separate data quality issue that should not be accounted for in the box plot calculations). For example, a dataset with a precipitation column for a place with very little rain will be mostly zeros; however that's not a data quality problem. In that case, the box plot with the zeros present is useful in its own way, as rainy days are indeed outliers. But it may also be useful to isolate the non zero data (the rainy days) in order to get a sense of what a rainy day usually looks like. Once the zeros are pulled out, we can inspect the data for outliers in the rainy days which should be far fewer and will be more useful for some data inspection purposes (checking for recording errors that need to be removed, for example).
Ideally, we can add this ability in a way that avoids doubling the computation and that lends itself to other, similar measures that could be added in the future. If this can't be done, it may be better for the Woodwork user to handle removing zeros themselves and run box_plot_dict
again on that series.
A potential idea for the api where users decide when they want the zeros excluded (this should be explored further, though):
box_plot, no_zero_box_plot = series.ww.box_plot_dict(exclude_zeros)