dask-ml Area under the receiving operating characteristic curve (AUROC) calculation.

A common metric for classification model performance is the area under the receiving operating characteristic curve (AUROC, AUROCC, or often simply AUC) which shows the area of the curve created by plotting TPR and FPR against one another. Wikipedia.

This metric is available in sklearn, but is currently missing from Dask-ml. One issue with computing this metric in Dask is that the standard implementation of AUROC requires searching over a sorted array and the sorting operations are often expensive in distributed computing.

I would be interested in a discussion of whether a naive version of this could be implemented using sorting and whether there are any known methods of computation that avoid the sort.

Sep 05 '23 21:09 stephenpardy

I’m not aware of a distributed-friendly way to calculate that metric, but happy to take a PR adding one if that exists. On Sep 5, 2023, at 4:58 PM, Stephen Pardy @.***> wrote:[dask/dask-ml] Area under the receiving operating characteristic curve (AUROC) calculation. (Issue #977)A common metric for classification model performance is the area under the receiving operating characteristic curve (AUROC, AUROCC, or often simply AUC) which shows the area of the curve created by plotting TPR and FPR against one another.Wikipedia.This metric is available in sklearn, but is currently missing from Dask-ml. One issue with computing this metric in Dask is that the standard implementation of AUROC requires searching over a sorted array and the sorting operations are often expensive in distributed computing.I would be interested in a discussion of whether a naive version of this could be implemented using sorting and whether there are any known methods of computation that avoid the sort.—Reply to this email directly, view it on GitHub or unsubscribe.You are receiving this email because you are subscribed to this thread.Triage notifications on the go with GitHub Mobile for iOS or Android.

Sep 24 '23 15:09 TomAugspurger

I don't know enough about this specific ML metric, but maybe if we can translate the problem to something more general then I can be of use.

For example, If someone had a distribution of values spread in an unsorted way across many machines and they wanted to plot an approximate sorted distribution of those points then there are a couple things they could do:

The simplest would be to just sample. I suspect that it would be easy to get very good accuracy by sampling.
But if for some reason they wanted to have all points influence the result then my recommendation would be to calculate a lot of approximate quantiles, for which there are good distributed algorithms. I suspect once one had, say, 100 percentiles (0%, 1%, 2%, ..., 99%, 100%) that one could take those values and reconstruct a distribution curve that was pretty close to full accuracy.

Depending on the exact curve necessary one might need to invert the delta between neighboring percentiles or something similar. This might get mildly complex, but is the sort of thing one can probably do with ten minutes of thought and, importantly, without distributed systems experience.

I may be entirely on the wrong track through. @stephenpardy can you help by reducing your problem a little bit to general numbers / computational terms? I suspect that it'll be easier for folks to help in that case.

Oct 18 '23 20:10 mrocklin

dask-ml dask-ml copied to clipboard

Area under the receiving operating characteristic curve (AUROC) calculation.

dask-ml
dask-ml copied to clipboard