MedLink Bounty
PR for MedLink bounty
Tests: To run the MedLink unit tests, from the project root run:
pytest tests/core/test_medlink.py (locally, 3 passed & 1 warning)
Model Implementation:
- Implemented the MedLink retrieval model on top of the current BaseModel / dataset API.
- Added unit tests with small synthetic data for MedLink.
- Added a Jupyter notebook that trains and evaluates MedLink on the MIMIC-III demo dataset.
Additions to "pyhealth/models/medlink/model.py":
- BaseModel-compatible "MedLink" class that takes a task-generated dataset (e.g., "SampleDataset" from "set_task") and "feature_keys".
- Vocabulary construction from the underlying task dataset using "dataset.get_all_tokens(...)" for queries and documents.
- Query and corpus encoders ("encode_queries", "encode_corpus") that produce sparse multi-hot representations.
- BM25-style scoring in "compute_scores", compatible with the IR-format data produced by the MedLink utilities.
- Combined retrieval and prediction loss in forward / get_loss, returning a scalar loss for training.
Other changes:
- Extended SampleDataset w/ get_all_tokens(key: str) to collect unique tokens across samples, used by MedLink for vocabulary building.
- Implemented BM25 and IR helpers in the pyhealth.models.medlink package:
- BM25Okapi
- convert_to_ir_format, tvt_split
- generate_candidates, filter_by_candidates
- get_bm25_hard_negatives, get_train_dataloader, get_eval_dataloader
- Exported MedLink via pyhealth.models.init, so users can do: from pyhealth.models import MedLink
Added examples/medlink_mimic3.ipynb, a runnable notebook that:
Loads the MIMIC-III demo dataset via MIMIC3Dataset.
Defines a patient linkage task to generate query–candidate pairs.
Uses the MedLink helpers to build IR-format data and PyTorch dataloaders.
Trains and evaluates MedLink and reports ranking metrics.
Locally ran:
examples/medlink_mimic3.ipynb runs end-to-end on the MIMIC-III demo dataset.
The notebook includes a note on how to run the MedLink unit tests from project root.
Files to review:
pyhealth/datasets/sample_dataset.py – SampleDataset.get_all_tokens helper for vocabulary construction.
pyhealth/models/medlink/model.py – core MedLink model implementation.
pyhealth/models/medlink/bm25.py – BM25Okapi implementation used in the retrieval pipeline.
pyhealth/models/medlink/utils.py – IR-format conversion, TVT split, candidate generation, dataloaders.
pyhealth/models/init.py – export of MedLink.
tests/core/test_medlink.py – synthetic unit tests for MedLink (forward pass, encoders, score shapes).
examples/medlink_mimic3.ipynb – Jupyter notebook for training and evaluating MedLink on the MIMIC-III demo dataset.