FedIRM
FedIRM copied to clipboard
Accuracy computation
Thank you for sharing your source code from your interesting paper!
In your paper, you have reported accuracy results for skin lesion classification and Intracranial hemorrhage diagnosis (ICH). I am using your code to run experiments on skin lesion diagnosis.
I noticed your accuracy computation implementation. I have a question regarding your accuracy computation. I think there is an issue with your accuracy.
This is what I found:
In your utils/metrics.py file, you are using class names from the skin lesion detection task. Therefore, I am assuming you used it for multi-class classification (with only one correct class per example). Furthermore, you are using your compute_metrics_test (in utils/metrics.py) function to compute the accuracy. I don’t think it is suitable for multi-class classification.
An example:
- If you have 10 classes and your testing examples are all from Class 1. Assuming your model always predicts Class 2, then according to your accuracy computation, you get an accuracy of 80%. Since you get 0% for Class 1, 0% for Class 2, but 100% for Class 3–Class 10, (In your implementation, you compute the average of that: 800/1000 = 80%)
Additionally, you are applying the Softmax function (validation.py) to each prediction and setting a threshold of 0.4. Every element (=class) of the prediction over 0.4 is evaluated as true, otherwise as false. This could lead to the case where you predict 2 classes for one testing example (which would correspond to multi-label).
However, I also think there is an issue with the accuracy computation if you use it for multi-label classification. Combining Softmax and thresholding with 0.4 results in a limit for your model to only predict a maximum of 2 correct classes (however, the datasets you described have more than 2 classes).
After I removed your accuracy computation and replaced it with the standard accuracy computation, I got an accuracy of around 80-85% for my skin lesion classification task. Accuracy of 80–85% is comparable to results from other researchers and papers on skin lesion detection (e.g. https://www.sciencedirect.com/science/article/pii/S1361841521003509). It seems that your accuracy implementation has influenced your reported 95% accuracy on skin lesion classification and probably your ICH experiments.
Could you check your accuracy computation again and check whether you think it is correct?
Thank you!