cmu-thesis icon indicating copy to clipboard operation
cmu-thesis copied to clipboard

Evaluation Script util_out.py discrepency

Open lijuncheng16 opened this issue 2 years ago • 5 comments

https://github.com/lijuncheng16/AudioTaggingDoneRight/blob/b8ec7f509b2a4909777e7572445a0f6233392ce1/src/utilities/stats.py

The AUC calculation would return a different result VS. using the sklearn metrics package, very interesting, but haven't figured out which step was different.

lijuncheng16 avatar May 13 '22 19:05 lijuncheng16

Do you have an example test case where the two scripts yield different results?

MaigoAkisame avatar May 13 '22 21:05 MaigoAkisame

Thank you Yun for your quick reply! Included are 2 pickle files of pred + truth file of (20123,527) shape

lijuncheng16 avatar May 13 '22 21:05 lijuncheng16

Your eval script's AUC and dprime is about 50% of the sklearn one below:

import numpy as np
from scipy import stats
from sklearn import metrics
import torch

def d_prime(auc):
    standard_normal = stats.norm()
    d_prime = standard_normal.ppf(auc) * np.sqrt(2.0)
    return d_prime

def calculate_stats(output, target):
    """Calculate statistics including mAP, AUC, etc.
    Args:
      output: 2d array, (samples_num, classes_num)
      target: 2d array, (samples_num, classes_num)
    Returns:
      stats: list of statistic of each class.
    """

    classes_num = target.shape[-1]
    stats = []

    # Accuracy, only used for single-label classification such as esc-50, not for multiple label one such as AudioSet
    acc = metrics.accuracy_score(np.argmax(target, 1), np.argmax(output, 1))

    # Class-wise statistics
    for k in range(classes_num):

        # Average precision
        avg_precision = metrics.average_precision_score(
            target[:, k], output[:, k], average=None)

        # AUC
        auc = metrics.roc_auc_score(target[:, k], output[:, k], average=None)

        # Precisions, recalls
        (precisions, recalls, thresholds) = metrics.precision_recall_curve(
            target[:, k], output[:, k])

        # FPR, TPR
        (fpr, tpr, thresholds) = metrics.roc_curve(target[:, k], output[:, k])

        save_every_steps = 1000     # Sample statistics to reduce size
        dict = {'precisions': precisions[0::save_every_steps],
                'recalls': recalls[0::save_every_steps],
                'AP': avg_precision,
                'fpr': fpr[0::save_every_steps],
                'fnr': 1. - tpr[0::save_every_steps],
                'auc': auc,
                # note acc is not class-wise, this is just to keep consistent with other metrics
                'acc': acc
                }
        stats.append(dict)

    return stats

lijuncheng16 avatar May 13 '22 21:05 lijuncheng16

I've found the cause: the pred array you passed in has the float16 dtype, and my function roc isn't robust against it. Below is what happened in detail:

def roc(pred, truth):  # here, pred is float16 and truth is bool
    data = numpy.array(sorted(zip(pred, truth), reverse = True))  # here, data is float16
    pred, truth = data[:,0], data[:,1]  # here, both pred and truth becomes float16

    # now we're computing the cumsum on a float16 array.
    # float16 doesn't have enough precision to represent 2049:
    # when it adds 1 to 2048, the result is still 2048.
    # therefore the cumsum is capped at 2048, and messed up everything from here.
    TP = truth.cumsum()
    FP = (1 - truth).cumsum()

    mask = numpy.concatenate([numpy.diff(pred) < 0, numpy.array([True])])
    TP = numpy.concatenate([numpy.array([0]), TP[mask]])
    FP = numpy.concatenate([numpy.array([0]), FP[mask]])
    return TP, FP

I've modified the second line of the function to convert truth to bool, and this fixes the problem.

MaigoAkisame avatar May 14 '22 00:05 MaigoAkisame