ICD-BERT
ICD-BERT copied to clipboard
ICD-BERT: Multi-label Classification of ICD-10 Codes with BERT (CLEF 2019)
MLT-DFKI at CLEF eHealth Task 1: Multi-label Classification with BERT
Code for our submission at CLEF eHealth Task 1: Multilingual Information Extraction. For details, check here.
Requirements
If you're using new trasnformers library, then it is recommended to create virtual environment as this code was written with the older version (note there will be no issues even if both versions co-exist):
pip install pytorch-pretrained-bert
For migration to new library, look here. For baseline experiments, install scikit-learn
as well.
Data
Raw data can be found under exps-data/data/*.txt
(this was provided by task organizers).
Pre-preprocessed data can be found under exps-data/data/{train, dev, test}_data.pkl
as pickled files. English translations are also provided for reproducibility (Google Translate API was used to get translations).
ICD-10 Metadata can be found under exps-data/codes_and_titles_{de, en}.txt
, where each line is tab delimited as [ICD Code Description] \t [ICD Code]
.
Pre-trained Models
For static word embeddings, we used English and German vectors provided by fastText. For domain specific vectors, we used PubMed word2vec (only for English).
For contextualized word embeddings, BERT-base-cased and BioBERT for English and Multilingual-BERT-base-cased for German.
Store all the models under a directory MODELS
.
Running BERT Models
Set the path export BERT_MODEL=$MODELS/pubmed_pmc_470k
(e.g. BioBERT).
Convert TF checkpoint to PyTorch model
This script is provided by transformers library, but there might be some changes with new version so it is recommended to use the one installed with pytorch-pretrained-bert
:
python convert_tf_checkpoint_to_pytorch.py \
--tf_checkpoint_path $BERT_MODEL/biobert_model.ckpt \
--bert_config_file $BERT_MODEL/bert_config.json \
--pytorch_dump_path $BERT_MODEL/pytorch_model.bin
Fine-tune the model
Configure the paths:
export DATA_DIR=exps-data/data
export BERT_EXPS_DIR=tmp/bert-exps-dir
export CUDA_VISIBLE_DEVICES=0,1,2,3
Run the model:
python bert_multilabel_run_classifier.py \
--data_dir $DATA_DIR \
--use_data en \
--bert_model $BERT_MODEL \
--task_name clef \
--output_dir $BERT_EXPS_DIR/output \
--cache_dir $BERT_EXPS_DIR/cache \
--max_seq_length 256 \
--num_train_epochs 20.0 \
--do_train \
--do_eval \
--train_batch_size 64
BERT English models (BioBERT, BERT-base-cased) results can be reproduced by 20 epochs and for multilingual BERT, with 25 epochs.
Inference
Run predictions (change files to test/dev manually in processor):
python bert_multilabel_run_classifier.py \
--data_dir $DATA_DIR \
--use_data en \
--bert_model $BERT_EXPS_DIR/output \
--task_name clef \
--output_dir $BERT_EXPS_DIR/output \
--cache_dir $BERT_EXPS_DIR/cache \
--max_seq_length 256 \
--do_eval
Evaluate
Use official evaluation.py
script to evaluate:
python evaluation.py --ids_file=$DATA_DIR/ids_development.txt \
--anns_file=$DATA_DIR/anns_train_dev.txt \
--dev_file=$BERT_EXPS_DIR/output/preds_development.txt \
--out_file=$BERT_EXPS_DIR/output/eval_output.txt
Running Other Models
Change configurations here (no CLI yet). Main parameters are:
lang
: can be one of {en, de}
load_pretrain_ft
: whether to use fastText pre-trained embeddings, works for both languages.
load_pretrain_pubmed
: whether to use PubMed embeddings, works for English only.
pretrain_file
: path to pre-trained vectors, should be one of path/to/cc.{en, de}.300.vec
when load_pretrain_ft=True
and path/to/pubmed2018_w2v_400D.bin
when load_pretrain_pubmed=True
.
model_name
: name of the model; can be one of {cnn, han, slstm, clstm}
.
For other hyperparameters, check here.
After all the models have been tested and results placed under one directory (one has to manually check the folder names), use predict.py
to reproduce the numbers found in Results.txt
.
Citation
If you find our work useful, please consider citing:
@inproceedings{amin2019mlt,
title={MLT-DFKI at CLEF eHealth 2019: Multi-label Classification of ICD-10 Codes with BERT},
author={Amin, Saadullah and Neumann, G{\"u}nter and Dunfield, Katherine Ann and Vechkaeva, Anna and Chapman, Kathryn Annette and Wixted, Morgan Kelly},
booktitle={Proceedings of the 20th Conference and Labs of the Evaluation Forum (Working Notes)},
url = {http://ceur-ws.org/Vol-2380/paper_67.pdf},
pages = {1--15},
year = {2019}
}