Add support for the ISIC 2018 Bias Dataset

Open shivani-mangaleswaran opened this issue 1 month ago • 0 comments

Authors

Shivani Mangaleswaran ([email protected]) Garima Maheshwari ([email protected]) Contribution as part of the final project for CS598-DL4H

Dataset Overview

This PR adds the ISIC bias metadata dataset. The dataset was identified as a required dependency while reproducing this paper:

Jin, Q. 2025. A Study of Artifacts on Melanoma Classification under Diffusion-Based Perturbations. In Proceedings of Machine Learning Research: Conference onHealth, Inference, and Learning (CHIL) 2025, volume 287, 1–14. PMLR. https://proceedings.mlr.press/v287/jin25b.html

However, we traced the dataset origin to this paper: Alceu Bissoto, Eduardo Valle, "Sandra Avila Debiasing Skin Lesion Datasets and Models? Not So Fast", 2020; https://doi.org/10.48550/arXiv.2004.11457. The dataset itself was published by the authors of this paper here: https://github.com/alceubissoto/debiasing-skin/blob/master/artefacts-annotation/isic_bias.csv and this is the dataset we're contributing as part of this PR

PR Overview

Added support for the ISIC Bias metadata dataset.
The original dataset is a CSV delimited by ; instead of , so we've included cleaning/handling for both types
Added unit tests to load and specific specific data in the sample dataset
Made an additional change to base_dataset.py to use pl.len instead of pl.count to avoid warnings

Main files to review

pyhealth/datasets/isic_bias.py - Has the main ISICBiasDataset that loads data
tests/core/test_isic.py - Unit tests

Supporting files to review

docs/api/datasets.rst - Documentation
docs/api/datasets/pyhealth.datasets.ISICBiasDataset.rst - Documentation
pyhealth/datasets/base_dataset.py - Update to use pl.len instead of pl.count to avoid warnings
test-resources/core/isic_artifacts/isic_artifacts_cleaned/isic_bias.csv - Cleaned ',' delimited sample dataset
test-resources/core/isic_artifacts/isic_artifacts_raw/isic_bias.csv - Raw ';' delimited sample dataset

Test Cases:

Performed on both - the raw and cleaned datasets

Load the dataset
Verifying patient count
Verifying stats eval
Fetching a particular sample in the dataset and validating the expected value of each event

Dec 06 '25 21:12 shivani-mangaleswaran