I performed the following experiment

Downloaded datasets [1], [2] and [3]
Extracted PA views for control and pneumonia patients (for [2] all "pneumonia" images were used regardless of the type:bacteria/virus, for [3], only "normal" patients, or "lung opacity" patients were used)
Trained a convolutional network using oversampling to balance both labels and datasets (control and pneumonia images were sampled with 50% probability, and each dataset was sampled with 1/3 probability). This is to prevent the network from prioritizing a dataset or a label.
Selected the epoch with the best "balanced" validation accuracy (the "balanced" accuracy was computed by oversampling the validation datasets following the same strategy used for the training sets)

Achieving the following results

Specificity:

Dataset [1]: 0.8746355685131195
Dataset [2]: 0.8632478632478633
Dataset [3]: 0.9661399548532731

Sensibility:

Dataset [1]: 0.7647058823529411
Dataset [2]: 0.9794871794871794
Dataset [3]: 0.9581589958158996

The issue

The network seems to perform very well on dataset [3], where each image was manually reviewed by radiologists [4]. However it performs significantly worse on dataset [1], where most labels were extracted using NLP and the images were not reviewed (even leading to the inclusion of completely white, or completely black images [5]).

Do you think the quality of the images and annotations may be a limiting factor for the performance of the network?

References

[1] http://ceib.bioinfo.cipf.es/covid19/resized_padchest_neumo.tar.gz [2] https://www.kaggle.com/paultimothymooney/chest-xray-pneumonia [3] https://www.kaggle.com/c/rsna-pneumonia-detection-challenge [4] https://www.kaggle.com/c/rsna-pneumonia-detection-challenge/overview/acknowledgements [5] https://github.com/BIMCV-CSUSP/BIMCV-COVID-19/tree/master/padchest-covid#iti---proposal-for-datasets

Apr 21 '20 19:04 stbnps

Images that seem to be white or black have data in them. Just normalize[0 - 1], multiply it by 255, and plot it or save it.

Jun 11 '20 20:06 rahools

Images that seem to be white or black have data in them. Just normalize[0 - 1], multiply it by 255, and plot it or save it.

This comment is the answer to Q1 in: BIMCV-COVID19+/FAQ.md

Jun 16 '20 22:06 samils7

@rahools That's not true. Take a look at image 216840111366964013590140476722013038132133659_02-059-019.png:

You can see a white line. That white line means that the image is already scaled.

@samils7 That FAQ is for BIMCV-COVID19+, not for padchest-covid

Jun 17 '20 11:06 stbnps

my bad, I successfully applied normalization on BIMCV-COVID19+ so I thought that would translate to padchest dataset too. Thanks for the insight @stbnps

Jun 17 '20 11:06 rahools

BIMCV-COVID-19
BIMCV-COVID-19 copied to clipboard

Dataset "usability" for AI

I performed the following experiment

Achieving the following results

The issue

References

BIMCV-COVID-19 BIMCV-COVID-19 copied to clipboard

Dataset "usability" for AI

I performed the following experiment

Achieving the following results

The issue

References

BIMCV-COVID-19
BIMCV-COVID-19 copied to clipboard