covid-chestxray-dataset icon indicating copy to clipboard operation
covid-chestxray-dataset copied to clipboard

Do the red arrows on some images create a danger for data leakage?

Open bganglia opened this issue 4 years ago • 3 comments

It just occurred to me that arrows only occur on images with a positive diagnosis, so this could cause data leakage.

That might not be as much of problem if you are using these images for differential diagnosis, and already know the patient has something, but it could be an issue if this dataset is being combined with healthy images to decide whether the patient is healthy or sick.

bganglia avatar Mar 18 '20 00:03 bganglia

If images from patients who might be healthy are being compared to these images, the small figure labels (e.g. "A", "B") could also lead to data leakage.

bganglia avatar Mar 18 '20 00:03 bganglia

True. This is a challenge to overcome. However, the models trained with a lot of data don't suffer from this issue though. Look at this example processed using a model trained on the 100k NIH examples: Screen Shot 2020-03-17 at 9 12 34 PM

The gradient of the prediction with respect to the input is not using them to make a prediction. So it is possible that the features from those pretrained models (in the torchxrayvision library) can easily ignore the arrows and focus on the right features.

ieee8023 avatar Mar 18 '20 01:03 ieee8023

Is it possible to mark such images in the metadata.csv?

bfreskura avatar Apr 09 '20 09:04 bfreskura