Add support for the ISIC 2018 Bias Dataset
Authors
Shivani Mangaleswaran ([email protected]) Garima Maheshwari ([email protected]) Contribution as part of the final project for CS598-DL4H
Dataset Overview
This PR adds the ISIC bias metadata dataset. The dataset was identified as a required dependency while reproducing this paper:
Jin, Q. 2025. A Study of Artifacts on Melanoma Classification under Diffusion-Based Perturbations. In Proceedings of Machine Learning Research: Conference onHealth, Inference, and Learning (CHIL) 2025, volume 287, 1–14. PMLR. https://proceedings.mlr.press/v287/jin25b.html
However, we traced the dataset origin to this paper: Alceu Bissoto, Eduardo Valle, "Sandra Avila Debiasing Skin Lesion Datasets and Models? Not So Fast", 2020; https://doi.org/10.48550/arXiv.2004.11457. The dataset itself was published by the authors of this paper here: https://github.com/alceubissoto/debiasing-skin/blob/master/artefacts-annotation/isic_bias.csv and this is the dataset we're contributing as part of this PR
PR Overview
- Added support for the ISIC Bias metadata dataset.
- The original dataset is a CSV delimited by ; instead of , so we've included cleaning/handling for both types
- Added unit tests to load and specific specific data in the sample dataset
- Made an additional change to base_dataset.py to use pl.len instead of pl.count to avoid warnings
Main files to review
- pyhealth/datasets/isic_bias.py - Has the main ISICBiasDataset that loads data
- tests/core/test_isic.py - Unit tests
Supporting files to review
- docs/api/datasets.rst - Documentation
- docs/api/datasets/pyhealth.datasets.ISICBiasDataset.rst - Documentation
- pyhealth/datasets/base_dataset.py - Update to use pl.len instead of pl.count to avoid warnings
- test-resources/core/isic_artifacts/isic_artifacts_cleaned/isic_bias.csv - Cleaned ',' delimited sample dataset
- test-resources/core/isic_artifacts/isic_artifacts_raw/isic_bias.csv - Raw ';' delimited sample dataset
Test Cases:
Performed on both - the raw and cleaned datasets
- Load the dataset
- Verifying patient count
- Verifying stats eval
- Fetching a particular sample in the dataset and validating the expected value of each event