PyHealth icon indicating copy to clipboard operation
PyHealth copied to clipboard

MedLink Bounty

Open Rian354 opened this issue 1 month ago • 0 comments

PR for MedLink bounty

Tests: To run the MedLink unit tests, from the project root run:

pytest tests/core/test_medlink.py (locally, 3 passed & 1 warning)

Model Implementation:

  • Implemented the MedLink retrieval model on top of the current BaseModel / dataset API.
  • Added unit tests with small synthetic data for MedLink.
  • Added a Jupyter notebook that trains and evaluates MedLink on the MIMIC-III demo dataset.

Additions to "pyhealth/models/medlink/model.py":

  • BaseModel-compatible "MedLink" class that takes a task-generated dataset (e.g., "SampleDataset" from "set_task") and "feature_keys".
  • Vocabulary construction from the underlying task dataset using "dataset.get_all_tokens(...)" for queries and documents.
  • Query and corpus encoders ("encode_queries", "encode_corpus") that produce sparse multi-hot representations.
  • BM25-style scoring in "compute_scores", compatible with the IR-format data produced by the MedLink utilities.
  • Combined retrieval and prediction loss in forward / get_loss, returning a scalar loss for training.

Other changes:

  • Extended SampleDataset w/ get_all_tokens(key: str) to collect unique tokens across samples, used by MedLink for vocabulary building.
  • Implemented BM25 and IR helpers in the pyhealth.models.medlink package:
    • BM25Okapi
    • convert_to_ir_format, tvt_split
    • generate_candidates, filter_by_candidates
    • get_bm25_hard_negatives, get_train_dataloader, get_eval_dataloader
  • Exported MedLink via pyhealth.models.init, so users can do: from pyhealth.models import MedLink

Added examples/medlink_mimic3.ipynb, a runnable notebook that:

Loads the MIMIC-III demo dataset via MIMIC3Dataset.

Defines a patient linkage task to generate query–candidate pairs.

Uses the MedLink helpers to build IR-format data and PyTorch dataloaders.

Trains and evaluates MedLink and reports ranking metrics.

Locally ran:

examples/medlink_mimic3.ipynb runs end-to-end on the MIMIC-III demo dataset.

The notebook includes a note on how to run the MedLink unit tests from project root.

Files to review:

pyhealth/datasets/sample_dataset.py – SampleDataset.get_all_tokens helper for vocabulary construction.

pyhealth/models/medlink/model.py – core MedLink model implementation.

pyhealth/models/medlink/bm25.py – BM25Okapi implementation used in the retrieval pipeline.

pyhealth/models/medlink/utils.py – IR-format conversion, TVT split, candidate generation, dataloaders.

pyhealth/models/init.py – export of MedLink.

tests/core/test_medlink.py – synthetic unit tests for MedLink (forward pass, encoders, score shapes).

examples/medlink_mimic3.ipynb – Jupyter notebook for training and evaluating MedLink on the MIMIC-III demo dataset.

Rian354 avatar Dec 08 '25 20:12 Rian354