PyHealth icon indicating copy to clipboard operation
PyHealth copied to clipboard

Add probabilistic extrapolation model for classification accuracy

Open minghao-sun-sc opened this issue 1 year ago • 0 comments

Pull Request: Accuracy Extrapolation Module for PyHealth

Contributor Information

  • Contributors: Minghao Sun
  • UIUC NetID: msun60
  • Paper title: A Probabilistic Method to Predict Classifier Accuracy on Larger Datasets given Small Pilot Data
  • Paper link: https://arxiv.org/abs/2311.18025

Contribution Type

Dataset Performance Extrapolation Module

Description

This pull request adds a new module to PyHealth that enables users to predict model performance (accuracy, AUROC, etc.) when trained on larger datasets based on smaller pilot datasets. The implementation builds on the APEx-GP approach from "A Probabilistic Method to Predict Classifier Accuracy on Larger Datasets given Small Pilot Data" with two significant improvements:

  1. Matern Kernels: Provides more realistic modeling of learning curves compared to standard RBF kernels, achieving lower MSE (up to 13.1% improvement)
  2. Beta Priors: Better handling of bounded accuracy metrics (like AUROC) constrained to [0,1]

The module is particularly valuable for healthcare ML applications where data collection is expensive and time-consuming, as it helps researchers make informed decisions about whether collecting more data is likely to significantly improve model performance.

Files Overview

  • Core Implementation:

    • pyhealth/metrics/extrapolation.py: Main module implementing GP-based performance extrapolation
    • pyhealth/metrics/__init__.py: Updated to include the new module exports
    • pyhealth/utils.py: Added tensor_to_numpy helper function
  • Examples & Documentation:

    • pyhealth/metrics/README_EXTRAPOLATION.md: Detailed module documentation
    • PyHealth/examples/accuracy_extrapolation_example.py: Example usage script
  • Tests:

    • pyhealth/unittests/test_extrapolation.py: Unit tests for the module
  • Dependencies:

    • Added gpytorch and matplotlib to requirements.txt

minghao-sun-sc avatar May 08 '25 03:05 minghao-sun-sc