BIMCV-COVID-19
BIMCV-COVID-19 copied to clipboard
Dataset "usability" for AI
I performed the following experiment
- Downloaded datasets [1], [2] and [3]
- Extracted PA views for control and pneumonia patients (for [2] all "pneumonia" images were used regardless of the type:bacteria/virus, for [3], only "normal" patients, or "lung opacity" patients were used)
- Trained a convolutional network using oversampling to balance both labels and datasets (control and pneumonia images were sampled with 50% probability, and each dataset was sampled with 1/3 probability). This is to prevent the network from prioritizing a dataset or a label.
- Selected the epoch with the best "balanced" validation accuracy (the "balanced" accuracy was computed by oversampling the validation datasets following the same strategy used for the training sets)
Achieving the following results
Specificity:
- Dataset [1]: 0.8746355685131195
- Dataset [2]: 0.8632478632478633
- Dataset [3]: 0.9661399548532731
Sensibility:
- Dataset [1]: 0.7647058823529411
- Dataset [2]: 0.9794871794871794
- Dataset [3]: 0.9581589958158996
The issue
The network seems to perform very well on dataset [3], where each image was manually reviewed by radiologists [4]. However it performs significantly worse on dataset [1], where most labels were extracted using NLP and the images were not reviewed (even leading to the inclusion of completely white, or completely black images [5]).
Do you think the quality of the images and annotations may be a limiting factor for the performance of the network?
References
[1] http://ceib.bioinfo.cipf.es/covid19/resized_padchest_neumo.tar.gz [2] https://www.kaggle.com/paultimothymooney/chest-xray-pneumonia [3] https://www.kaggle.com/c/rsna-pneumonia-detection-challenge [4] https://www.kaggle.com/c/rsna-pneumonia-detection-challenge/overview/acknowledgements [5] https://github.com/BIMCV-CSUSP/BIMCV-COVID-19/tree/master/padchest-covid#iti---proposal-for-datasets
Images that seem to be white or black have data in them. Just normalize[0 - 1], multiply it by 255, and plot it or save it.
Images that seem to be white or black have data in them. Just normalize[0 - 1], multiply it by 255, and plot it or save it.
This comment is the answer to Q1 in: BIMCV-COVID19+/FAQ.md
@rahools That's not true. Take a look at image 216840111366964013590140476722013038132133659_02-059-019.png:
You can see a white line. That white line means that the image is already scaled.
@samils7 That FAQ is for BIMCV-COVID19+, not for padchest-covid
my bad, I successfully applied normalization on BIMCV-COVID19+ so I thought that would translate to padchest dataset too. Thanks for the insight @stbnps