PyHealth icon indicating copy to clipboard operation
PyHealth copied to clipboard

[CS 598DLH] Add BiLM + BiLSTM NER biomedical example to PyHealth

Open zyacub opened this issue 1 month ago • 0 comments

BiLM + BiLSTM NER Example (Biomedical NER)

This example demonstrates a simple end-to-end pipeline for biomedical named entity recognition (NER) using:

  • A bidirectional language model (BiLM) over tokens for unsupervised pretraining.
  • A BiLSTM token classifier for NER (no CRF for simplicity).
  • A small CoNLL-style biomedical NER dataset (or a built-in synthetic toy dataset).

The goal is to show how a research-style reproduction (BiLM + NER) can be packaged as a reusable PyHealth example, improving the reproducibility of AI4H models.

This example is adapted from a course project that reproduces a published biomedical NER architecture and evaluates the effect of BiLM pretraining on NER performance.


Files

  • bilm_ner.py
    Main script which:

    • Loads a CoNLL-style token-level NER dataset (or a synthetic toy dataset).
    • Builds a BiLM over token IDs.
    • Trains the BiLM for a few epochs on unlabeled sentences.
    • Builds a BiLSTM-based NER model.
    • Trains:
      • a baseline NER model (no pretraining), and
      • a BiLM-pretrained NER model (word embeddings + forward LSTM initialized from BiLM).
    • Reports test F1 for both models.
  • test_bilm_ner.py
    unittest test suite which:

    • Verifies that the synthetic dataset builder works.
    • Verifies that the BiLM forward pass runs and returns a finite scalar loss.
    • Verifies that the NER model forward + backward passes work and produce gradients.
    • Verifies that decode() returns sequences whose lengths match the unpadded token lengths.

These tests are lightweight and designed to run quickly on CPU.


Dataset Format

By default, the example can run entirely on a synthetic toy dataset (no external files required).

To use a real dataset, provide files in a simple CoNLL-style TSV format:

  • One token per line.
  • Columns: TOKEN<TAB>TAG
  • Sentences separated by blank lines.

Example:

BRAF    B-GENE
mutation    O
in  O
melanoma    B-DISEASE

EGFR    B-GENE
mutations   O
are O
common  O

zyacub avatar Dec 08 '25 01:12 zyacub