pROC icon indicating copy to clipboard operation
pROC copied to clipboard

case weights

Open topepo opened this issue 4 years ago • 4 comments

It would be great to have the calculations for the curve take into account cases weights (i.e. a non-negative, numeric vector of values the same length as the other data objects).

topepo avatar Jul 27 '21 16:07 topepo

I agree this would be cool. Do you have a reference on how this is implemented in the context of ROC curves?

xrobin avatar Aug 02 '21 14:08 xrobin

The curve would be based on the weighted versions of sensitivity and specificity.

library(tidymodels)
#> Registered S3 method overwritten by 'tune':
#>   method                   from   
#>   required_pkgs.model_spec parsnip

data(pathology)
str(pathology)
#> 'data.frame':    344 obs. of  2 variables:
#>  $ pathology: Factor w/ 2 levels "abnorm","norm": 1 1 1 1 1 1 1 1 1 1 ...
#>  $ scan     : Factor w/ 2 levels "abnorm","norm": 1 1 1 1 1 1 1 1 1 1 ...

set.seed(1)
pathology$weights <- runif(nrow(pathology))

event <- "abnorm"

unweighted <- 
  sum(pathology$pathology == event & pathology$scan == event) /
  sum(pathology$pathology == event)
unweighted
#> [1] 0.8953488

# via yardstick:
sensitivity(pathology, pathology, scan)
#> # A tibble: 1 × 3
#>   .metric .estimator .estimate
#>   <chr>   <chr>          <dbl>
#> 1 sens    binary         0.895

weighted <- 
  sum( pathology$weights * (pathology$pathology == event & pathology$scan == event) ) /
  sum( pathology$weights * (pathology$pathology == event) )

weighted
#> [1] 0.9013333

Created on 2021-09-13 by the reprex package (v2.0.0)

@davisvaughan has the start of changes that we will be making to yardstick here

topepo avatar Sep 13 '21 19:09 topepo

I think I see. The easiest would be to directly update the roc.utils.perfs.all.fast to calculate TP/FP taking the weights into account:

  tp <- cumsum(response.sorted==1 * weights.sorted)
  fp <- cumsum(response.sorted==0 * weights.sorted)

A few thought on the implementation:

  • The number of cases and controls might become fractional because of this change. I'm not sure what side-effects this could have.
  • There's a C++ algorithm that will need to be updated too. It's a loop so it should be quite straightforward. Alternatively it could be a good time to get rid of alternative algorithms and simplify the code.
  • It will be necessary to modify the roc objects and store the weights there, so that bootstrap functions re-use the weights appropriately.
  • At this point I'm not sure how much changes will be required in those bootstrapping functions. They've needed major refactoring for a long time but I never found the time to do so.
  • Issue #70 will get in the way. There's quite a lot of redundancy as pROC has several functions that build ROC curves under the hood (ie auc, ci, etc), which will have to be updated.

xrobin avatar Sep 15 '21 08:09 xrobin