cvAUC icon indicating copy to clipboard operation
cvAUC copied to clipboard

ci.cvAUC needs 0.5 for ties

Open sgruber65 opened this issue 3 years ago • 5 comments

Hi Erin, The ROCR package's calculation of the AUC assigns 0.5 points for a tie. I was looking at your code for calculating the CIs, and saw that it ignores that possibility. Although people argue over strategies for dealing with ties, since the code is estimating the variance of the cv-AUC, as calculated by the ROCR package, it ought to respect the underlying calculation of the AUC.

DT[, :=(icVal, ifelse(label == pos, w1 * (fracNegLabelsWithSmallerPreds - auc), w0 * (fracPosLabelsWithLargerPreds - auc)))]

For some positive observation, i, this line will assign w1 * 1 to each negLabel earlier in the ordering, when for some subset of those it should possibly be w1 * 0.5. Also, there may be one or more negLabel observations immediately after i in the ordering that should be counted as 0.5, instead of 0. (Of course, similar logic applies to the negative label calculations.)

--Susan Gruber

sgruber65 avatar Sep 09 '20 01:09 sgruber65

Thanks, @sgruber65, for pointing this out. Did you have a specific code fix in mind to resolve this?

Is there an easy way to identify which rows, i, should be w1 * 0.5 instead of w1 * 1.0? If so, then perhaps we can add a line of code right after the one above, which corrects the weights. It's been a long time since I wrote this code, so it would take me a while to get familiar with it again, in order to dig in deeper.

ledell avatar Jan 18 '21 06:01 ledell

Hi Erin, The AUC calculation returned by the call to ROCR is correct — the only problem is the IC. It captures the formula in the 2015 paper, but that isn’t correct.

Here’s the IC function inside of the cvAUC function (v1.1.0 of the cvAUC package from CRAN)

.IC <- function(fold_preds, fold_labels, pos, neg, w1, w0) { n_rows <- length(fold_labels) n_pos <- sum(fold_labels == pos) n_neg <- n_rows - n_pos auc <- AUC(fold_preds, fold_labels) DT <- data.table(pred = fold_preds, label = fold_labels) DT <- DT[order(pred, -xtfrm(label))] DT[, :=(fracNegLabelsWithSmallerPreds, cumsum(label == neg)/n_neg)] DT <- DT[order(-pred, label)] DT[, :=(fracPosLabelsWithLargerPreds, cumsum(label == pos)/n_pos)] DT[, :=(icVal, ifelse(label == pos, w1 * (fracNegLabelsWithSmallerPreds - auc), w0 * (fracPosLabelsWithLargerPreds - auc)))] return(mean(DT$icVal^2)) }

We want to add 0.5 points for ties. Also notice that when there are ties, ordering the observations and using cumsum won’t work, since some negative observations with the same predicted value might be ranked both before and after positive observations with that value.

Here’s a version that works. Nothing else has to change.

.ICv2 <- function(fold_preds, fold_labels, pos, neg, w1, w0) { n_rows <- length(fold_labels) n_pos <- sum(fold_labels == pos) n_neg <- n_rows - n_pos pos_rows <- fold_labels == pos neg_rows <- fold_labels == neg auc <- AUC(fold_preds, fold_labels) DT <- data.table(pred = fold_preds, label = fold_labels) DT[pos_rows, :=(icVal, apply(DT[pos_rows,], 1, function(x){ sum(x["pred"] > DT[neg_rows, pred] + .5*(x["pred"] == DT[neg_rows,pred]))})/n_neg * w1 - aucw1)] DT[neg_rows, :=(icVal, apply(DT[neg_rows,], 1, function(x){ sum(x["pred"] < DT[pos_rows, pred] + .5(x["pred"] == DT[pos_rows,pred]))})/n_pos * w0 - auc*w0)] return(mean(DT$icVal^2)) }

—Susan

On Jan 18, 2021, at 1:38 AM, Erin LeDell [email protected] wrote:

Thanks, @sgruber65, for pointing this out. Did you have a specific code fix in mind to resolve this?

Is there an easy way to identify which rows, i, should be w1 * 0.5 instead of w1 * 1.0? If so, then perhaps we can add a line of code right after the one above, which corrects the weights. It's been a long time since I wrote this code, so it would take me a while to get familiar with it again, in order to dig in deeper.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe.

sgruber65 avatar Jan 19 '21 20:01 sgruber65

Hi Erin, When you get a chance can you upload a new version to CRAN that uses the .IC function I defined below?

Thanks, Susan

Begin forwarded message:

From: Susan Gruber [email protected] Subject: Re: [ledell/cvAUC] ci.cvAUC needs 0.5 for ties (#6) Date: January 19, 2021 at 3:53:20 PM EST To: ledell/cvAUC [email protected] Cc: ledell/cvAUC [email protected], Mention [email protected]

Hi Erin, The AUC calculation returned by the call to ROCR is correct — the only problem is the IC. It captures the formula in the 2015 paper, but that isn’t correct.

Here’s the IC function inside of the cvAUC function (v1.1.0 of the cvAUC package from CRAN)

.IC <- function(fold_preds, fold_labels, pos, neg, w1, w0) { n_rows <- length(fold_labels) n_pos <- sum(fold_labels == pos) n_neg <- n_rows - n_pos auc <- AUC(fold_preds, fold_labels) DT <- data.table(pred = fold_preds, label = fold_labels) DT <- DT[order(pred, -xtfrm(label))] DT[, :=(fracNegLabelsWithSmallerPreds, cumsum(label == neg)/n_neg)] DT <- DT[order(-pred, label)] DT[, :=(fracPosLabelsWithLargerPreds, cumsum(label == pos)/n_pos)] DT[, :=(icVal, ifelse(label == pos, w1 * (fracNegLabelsWithSmallerPreds - auc), w0 * (fracPosLabelsWithLargerPreds - auc)))] return(mean(DT$icVal^2)) }

We want to add 0.5 points for ties. Also notice that when there are ties, ordering the observations and using cumsum won’t work, since some negative observations with the same predicted value might be ranked both before and after positive observations with that value.

Here’s a version that works. Nothing else has to change.

.ICv2 <- function(fold_preds, fold_labels, pos, neg, w1, w0) { n_rows <- length(fold_labels) n_pos <- sum(fold_labels == pos) n_neg <- n_rows - n_pos pos_rows <- fold_labels == pos neg_rows <- fold_labels == neg auc <- AUC(fold_preds, fold_labels) DT <- data.table(pred = fold_preds, label = fold_labels) DT[pos_rows, :=(icVal, apply(DT[pos_rows,], 1, function(x){ sum(x["pred"] > DT[neg_rows, pred] + .5*(x["pred"] == DT[neg_rows,pred]))})/n_neg * w1 - aucw1)] DT[neg_rows, :=(icVal, apply(DT[neg_rows,], 1, function(x){ sum(x["pred"] < DT[pos_rows, pred] + .5(x["pred"] == DT[pos_rows,pred]))})/n_pos * w0 - auc*w0)] return(mean(DT$icVal^2)) }

—Susan

On Jan 18, 2021, at 1:38 AM, Erin LeDell <[email protected] mailto:[email protected]> wrote:

Thanks, @sgruber65, for pointing this out. Did you have a specific code fix in mind to resolve this?

Is there an easy way to identify which rows, i, should be w1 * 0.5 instead of w1 * 1.0? If so, then perhaps we can add a line of code right after the one above, which corrects the weights. It's been a long time since I wrote this code, so it would take me a while to get familiar with it again, in order to dig in deeper.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe.

sgruber65 avatar Mar 04 '21 17:03 sgruber65

Hi @sgruber65 I am sorry for the delay on this -- i was locked out of my berkeley.edu email and so I had to sort that out before being able to update the package (since this package uses my old email and you can't update a package w/o access).

Thank you for providing the code! I think I can use the same code for the pooled version, as well.

I have opened a PR here with some remaining tasks noted: https://github.com/ledell/cvAUC/pull/11

ledell avatar Apr 27 '21 22:04 ledell

Thanks, Erin. And I agree, this should be the same for the pooled version.

sgruber65 avatar May 06 '21 18:05 sgruber65