yardstick icon indicating copy to clipboard operation
yardstick copied to clipboard

Add `extract_plot_data()` and fill value to `autoplot()` for type = "mosaic"

Open joeycouse opened this issue 3 years ago • 4 comments

Resolves part of #240

Just putting this out there to get your thoughts on the interface. I've implemented an extract_plot_data() function for the confusion matrix class which returns a list with the relevant plot data.

Would need to be extended for other autoplot() use cases e.g. roc_curve, etc. If you think this is something worth merging I can put together the other methods. Just wanted to get y'alls take before putting more effort into this. Thanks!

New autoplot(type = 'mosaic') correct prediction boxes filled with a light blue image

library(tidyverse)
library(yardstick)
#> For binary classification, the first factor level is assumed to be the event.
#> Use the argument `event_level = "second"` to alter this as needed.
#> 
#> Attaching package: 'yardstick'
#> The following object is masked from 'package:readr':
#> 
#>     spec

hpc_cv %>%
  conf_mat(obs, pred) %>%
  extract_plot_data(type = 'mosaic')
#> $data
#>    pred_type        ymin        ymax     xmin     xmax
#> 1    correct  0.00000000 -0.91577162    0.000 1769.000
#> 2  incorrect -0.92577162 -1.00547767    0.000 1769.000
#> 3  incorrect -1.01547767 -1.01886942    0.000 1769.000
#> 4  incorrect -1.02886942 -1.03000000    0.000 1769.000
#> 5  incorrect  0.00000000 -0.34415584 1786.335 2864.335
#> 6    correct -0.35415584 -0.95434137 1786.335 2864.335
#> 7  incorrect -0.96434137 -0.98660482 1786.335 2864.335
#> 8  incorrect -0.99660482 -1.03000000 1786.335 2864.335
#> 9  incorrect  0.00000000 -0.15533981 2881.670 3293.670
#> 10 incorrect -0.16533981 -0.69689320 2881.670 3293.670
#> 11   correct -0.70689320 -0.89864078 2881.670 3293.670
#> 12 incorrect -0.90864078 -1.03000000 2881.670 3293.670
#> 13 incorrect  0.00000000 -0.04326923 3311.005 3519.005
#> 14 incorrect -0.05326923 -0.34173077 3311.005 3519.005
#> 15 incorrect -0.35173077 -0.48634615 3311.005 3519.005
#> 16   correct -0.49634615 -1.03000000 3311.005 3519.005
#> 
#> $x_breaks
#>       VF        F        M        L 
#>  884.500 2325.335 3087.670 3415.005 
#> 
#> $y_breaks
#> [1] -0.4578858 -0.9656246 -1.0171735 -1.0294347
#> 
#> $tick_labels
#> [1] "VF" "F"  "M"  "L" 
#> 
#> $axis_labels
#> $axis_labels$y
#> [1] "Prediction"
#> 
#> $axis_labels$x
#> [1] "Truth"

Created on 2021-12-10 by the reprex package (v2.0.1)

joeycouse avatar Dec 10 '21 19:12 joeycouse

Is there some existing generic we should use for this, rather than making a new one? fortify comes to mind, although the docs say not to use it.

juliasilge avatar Dec 14 '21 17:12 juliasilge

The only function I'm aware of that achieves something similar is the ggplot2::ggplot_build() which accepts a plot object and returns a dataframe of the plot data. Although the returned dataframe isn't in a format I think would address #248 adequately.

library(tidymodels)
#> Registered S3 method overwritten by 'tune':
#>   method                   from   
#>   required_pkgs.model_spec parsnip

data("two_class_example")

two_class_example %>%
  conf_mat(truth, predicted) %>%
  autoplot() %>%
  ggplot_build() %>%
  pluck(1)
#> [[1]]
#>      fill  xmin  xmax       ymin       ymax PANEL group colour size linetype
#> 1 #4f58bd   0.0 258.0  0.0000000 -0.8798450     1     1     NA  0.5        1
#> 2  grey70   0.0 258.0 -0.8898450 -1.0100000     1     2     NA  0.5        1
#> 3  grey70 260.5 502.5  0.0000000 -0.2066116     1     2     NA  0.5        1
#> 4 #4f58bd 260.5 502.5 -0.2166116 -1.0100000     1     1     NA  0.5        1
#>   alpha
#> 1   0.9
#> 2   0.9
#> 3   0.9
#> 4   0.9

Created on 2021-12-17 by the reprex package (v2.0.1)

joeycouse avatar Dec 17 '21 19:12 joeycouse

I don't think that it is a good idea to make a new generic. The tidy() method should translate the object to a tabular data structure that can be used as the substrate for the autoplot() method. The tidy() method for the confusion matrix is maybe not the best (I think that I wrote it) and can be improved to make some of your (and Julia's) code more concise.

That would be enough to get data for the heatmap but the mosaic plot would need some additional, non-tabular data. So, I propose:

  1. I'll update the PR to improve the tidy() method:

  2. @joeycouse can take their work on cm_mosaic_data() to make a function that we can export so facilitate custom mosaic plots.

I don't think that we need cm_heat_data() nor do we need to export get_axis_labels().

@DavisVaughan and @juliasilge how does that sound?

topepo avatar Dec 21 '21 01:12 topepo

I think this sounds like a good way to go 👍

juliasilge avatar Dec 21 '21 15:12 juliasilge