covid-chestxray-dataset
covid-chestxray-dataset copied to clipboard
Do the red arrows on some images create a danger for data leakage?
It just occurred to me that arrows only occur on images with a positive diagnosis, so this could cause data leakage.
That might not be as much of problem if you are using these images for differential diagnosis, and already know the patient has something, but it could be an issue if this dataset is being combined with healthy images to decide whether the patient is healthy or sick.
If images from patients who might be healthy are being compared to these images, the small figure labels (e.g. "A", "B") could also lead to data leakage.
True. This is a challenge to overcome. However, the models trained with a lot of data don't suffer from this issue though. Look at this example processed using a model trained on the 100k NIH examples:
The gradient of the prediction with respect to the input is not using them to make a prediction. So it is possible that the features from those pretrained models (in the torchxrayvision library) can easily ignore the arrows and focus on the right features.
Is it possible to mark such images in the metadata.csv
?