simpletransformers icon indicating copy to clipboard operation
simpletransformers copied to clipboard

Inconsistency in calculation of precision and recall

Open superqd opened this issue 3 years ago • 1 comments

Describe the bug In a BinaryClassification setup, with labels 0 and 1, after training I run eval_model and get back the results, which include true positives, false negatives and false positives. If i use these to calculate the precision and recall, I get one set of numbers, but if I then use sklearn to calculate the same values, I get a different result. What is the result of eval_model based on? How do I indicate which label is the positive class?

To Reproduce

  best_model = ClassificationModel(
        'bert',
        best_model_folder,
        use_cuda=True,
        num_labels=2, # binary by default, just setting for clarity        
        args={ "use_multiprocessing": True, "do_lower_case": True, "sliding_window": USE_SLIDING_WINDOW_PREDICTION }
    )

    result, model_outputs, wrong_predictions = best_model.eval_model(validation)

    predictions = []
    for x in model_outputs:
        predictions.append(numpy.argmax(x))

    p_score = precision_score(validation['labels'], predictions, average=None)
    r_score = recall_score(validation['labels'], predictions, average=None)

Expected behavior I would expect the use of tp/fn/fp would result in the same score as precision_score and recall_score, but they never do. I know that recall would be tp/(tp+fn), and so on, but calculating that give s result different from recall_score. Perhaps the two use different positive class labels? How do I indicate to simpletransformers which label in the Binary Classifier if the positive class? When looking at the value of tp, I can't seem to reconcile it with anything in the data...

Screenshots result: {'mcc': 0.9314820172420215, 'tp': 2918, 'tn': 62, 'fp': 2, 'fn': 7, 'eval_loss': 0.028934799013644066} p_score: [0.89285714 1.0 ] r_score: [0.78125 0.86290598]

if you calculate recall using the result, you get 0.9976, which doesn't match either values in p_score or r_score. If you calculate precision, you get 0.9993. I don't know which label is being used to count the positive classes here.

I can try and start looking through the code to find it, but if anyone could shed light that would be great.

superqd avatar Feb 19 '22 02:02 superqd

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale[bot] avatar Apr 27 '22 13:04 stale[bot]