BIMCV-COVID-19 icon indicating copy to clipboard operation
BIMCV-COVID-19 copied to clipboard

Dataset "usability" for AI

Open stbnps opened this issue 4 years ago • 4 comments

I performed the following experiment

  • Downloaded datasets [1], [2] and [3]
  • Extracted PA views for control and pneumonia patients (for [2] all "pneumonia" images were used regardless of the type:bacteria/virus, for [3], only "normal" patients, or "lung opacity" patients were used)
  • Trained a convolutional network using oversampling to balance both labels and datasets (control and pneumonia images were sampled with 50% probability, and each dataset was sampled with 1/3 probability). This is to prevent the network from prioritizing a dataset or a label.
  • Selected the epoch with the best "balanced" validation accuracy (the "balanced" accuracy was computed by oversampling the validation datasets following the same strategy used for the training sets)

Achieving the following results

Specificity:

  • Dataset [1]: 0.8746355685131195
  • Dataset [2]: 0.8632478632478633
  • Dataset [3]: 0.9661399548532731

Sensibility:

  • Dataset [1]: 0.7647058823529411
  • Dataset [2]: 0.9794871794871794
  • Dataset [3]: 0.9581589958158996

The issue

The network seems to perform very well on dataset [3], where each image was manually reviewed by radiologists [4]. However it performs significantly worse on dataset [1], where most labels were extracted using NLP and the images were not reviewed (even leading to the inclusion of completely white, or completely black images [5]).

Do you think the quality of the images and annotations may be a limiting factor for the performance of the network?

References

[1] http://ceib.bioinfo.cipf.es/covid19/resized_padchest_neumo.tar.gz [2] https://www.kaggle.com/paultimothymooney/chest-xray-pneumonia [3] https://www.kaggle.com/c/rsna-pneumonia-detection-challenge [4] https://www.kaggle.com/c/rsna-pneumonia-detection-challenge/overview/acknowledgements [5] https://github.com/BIMCV-CSUSP/BIMCV-COVID-19/tree/master/padchest-covid#iti---proposal-for-datasets

stbnps avatar Apr 21 '20 19:04 stbnps

Images that seem to be white or black have data in them. Just normalize[0 - 1], multiply it by 255, and plot it or save it.

rahools avatar Jun 11 '20 20:06 rahools

Images that seem to be white or black have data in them. Just normalize[0 - 1], multiply it by 255, and plot it or save it.

This comment is the answer to Q1 in: BIMCV-COVID19+/FAQ.md

samils7 avatar Jun 16 '20 22:06 samils7

@rahools That's not true. Take a look at image 216840111366964013590140476722013038132133659_02-059-019.png: 216840111366964013590140476722013038132133659_02-059-019

You can see a white line. That white line means that the image is already scaled.

@samils7 That FAQ is for BIMCV-COVID19+, not for padchest-covid

stbnps avatar Jun 17 '20 11:06 stbnps

my bad, I successfully applied normalization on BIMCV-COVID19+ so I thought that would translate to padchest dataset too. Thanks for the insight @stbnps

rahools avatar Jun 17 '20 11:06 rahools