Add probabilistic extrapolation model for classification accuracy

Open minghao-sun-sc opened this issue 1 year ago • 0 comments

Pull Request: Accuracy Extrapolation Module for PyHealth

Contributor Information

Contributors: Minghao Sun
UIUC NetID: msun60
Paper title: A Probabilistic Method to Predict Classifier Accuracy on Larger Datasets given Small Pilot Data
Paper link: https://arxiv.org/abs/2311.18025

Contribution Type

Dataset Performance Extrapolation Module

Description

This pull request adds a new module to PyHealth that enables users to predict model performance (accuracy, AUROC, etc.) when trained on larger datasets based on smaller pilot datasets. The implementation builds on the APEx-GP approach from "A Probabilistic Method to Predict Classifier Accuracy on Larger Datasets given Small Pilot Data" with two significant improvements:

Matern Kernels: Provides more realistic modeling of learning curves compared to standard RBF kernels, achieving lower MSE (up to 13.1% improvement)
Beta Priors: Better handling of bounded accuracy metrics (like AUROC) constrained to [0,1]

The module is particularly valuable for healthcare ML applications where data collection is expensive and time-consuming, as it helps researchers make informed decisions about whether collecting more data is likely to significantly improve model performance.

Files Overview

Core Implementation:
- pyhealth/metrics/extrapolation.py: Main module implementing GP-based performance extrapolation
- pyhealth/metrics/__init__.py: Updated to include the new module exports
- pyhealth/utils.py: Added tensor_to_numpy helper function
Examples & Documentation:
- pyhealth/metrics/README_EXTRAPOLATION.md: Detailed module documentation
- PyHealth/examples/accuracy_extrapolation_example.py: Example usage script
Tests:
- pyhealth/unittests/test_extrapolation.py: Unit tests for the module
Dependencies:
- Added gpytorch and matplotlib to requirements.txt

May 08 '25 03:05 minghao-sun-sc