PICO_Parser
PICO_Parser copied to clipboard
A clinical BERT-based NLP tool for parsing clinical trial abstracts following the PICO framework
PICO_Parser
Parse RCT PubMed abstracts following PICO framework to standarize PICO elements.
- Author: Tian Kang ([email protected])
- Affiliation: Department of Biomedical Informatics, Columbia Univerisity (Dr. Chunhua Weng's lab)
- Citation: "Kang, T., Zou, S. and Weng, C., 2019. Pretraining to Recognize PICO Elements from Randomized Controlled Trial Literature. Studies in health technology and informatics, 264, p.188."
UPDATE May, 2020:
1. Solved the issues with BERT-based parser.
2. Pretrained Sentence classification model for RCT abstracts is now available.
Major updates coming soon: ^_^
More modules coming soon for representing medical evidence information comprehensively from RCT abstracts.
User Guide
NEW: BlueBERT-based Parser (bugs solved, May 2020):
Adapted from NCBI-NLP BlueBERT
-
Install
requirements.txt
-
If you want to use UMLS to standardize entities, please install 'UMLS' and 'QuickUMLS' locally
-
Download pretrained bluebert for PICO element recognition models (link in BERT )
-
Edit
parser_config.py
to customize your own diretories and BERT configuration -
Run to start parsing (specify your input in --data_dir and output directory in -- output_dir. In the input directory, each abstract text is put in one text file with its pmid as the file name. Example data is provided in
test
folder.python run_bluebert_ner_predict.py --data_dir= --output_dir=
To run examples:
python run_bluebert_ner_predict.py --data_dir=test/txt --output_dir=test/json`
Exmample
Input test/txt
Parsing results test/json
Original: LSTM Parser:
PICO Element with attributes in JSON/XML
- Install
requirements.txt
- If you want to use UMLS to standardize entities, please install 'UMLS' and 'QuickUMLS' locally
- Edit
parser_config.py
to customize your own diretories and installation - Run
python Phase1_NER_predict.py
to start parsing
Clustering parsed PICO elements to represent study design
- Download context vector pretrained in all pubmed abstracts from 1990-2019 (downlaod link in cluster/model/download.txt)
- Extract 3 files and put them under cluster/model
- TO BE CONTINUED
Exmample
JSON
Input example.txt
contain over 70+ abstracts with methods sections
Parsing results folder example_json_out
{
"pmid": "11264545",
"sentences": {
"sent_1": {
"Section": "METHODS",
"text": "METHODS AND RESULTS : To determine the relative power of radiographic heart measurements for predicting outcome in dilated cardiomyopathy , we retrospectively studied 88 adult patients with chest radiographs obtained within 35 days of echocardiography .",
"entities": {
"entity_1": {
"text": "radiographic heart measurements",
"class": "Outcome",
"negation": 0,
"UMLS": "C0018787:heart,C1306645:radiograph,",
"index": 1,
"start": 10
},
"entity_2": {
"text": "predicting outcome",
"class": "Outcome",
"negation": 0,
"UMLS": "",
"index": 2,
"start": 14
},
"entity_3": {
"text": "dilated cardiomyopathy",
"class": "Participant",
"nega tion": 0,
"UMLS": "C0007193:dilated cardiomyopathy,",
"index": 3,
"start": 17
},
"entity_4": {
"text": "chest radiographs",
"class": "Participant",
"negation": 0,
"UMLS": "C1306645:radiographs,C0817096:chest,",
"index": 4,
"start": 27
},
"entity_5": {
"text": "echocardiography",
"c lass": "Participant",
"negation": 0,
"UMLS": "C0013516:echocardiography,",
"index": 5,
"start": 34
}
},
"relations": {}
},
"sent_2": {
"Section": "METHODS",
"text": "Standard radiographic variables were measured for each patient , and the cardiothoracic ( CT ) ratio , frontal cardiac area , and volume were calculated .",
"entities": {
"entity_6": {
"text": "Standard radiographic variables",
"class": "Outcome",
"negation": 0,
"UMLS": "C0038137:Standard,C1306645:radiograph,",
"index": 1,
"start": 0
},
"entity_7": {
"text": "cardiothoracic ( CT ) ratio",
"class": "Outcome",
"negation": 0,
"UMLS": "",
"index": 2,
"start": 11
},
"entity_8": {
"text": "frontal cardiac area",
"class": "Outcome",
"negation": 0,
"UMLS": "C0018787:cardiac,",
"index": 3,
"start": 17
},
"entity_9": {
"text": "volume",
"class": "Outcome",
"negation": 0,
"UMLS": "",
"inde x": 4,
"start": 22
}
},
"relations": {}
}
}
}
XML
Input test.txt
Parsing results temp.xml
A double-blind crossover comparison of pindolol , metoprolol , atenolol and labetalol in mild to moderate hypertension . 1 This study was designed to compare in a double-blind randomized crossover trial , atenolol , labetalol , metoprolol and pindolol . Considerable differences in dose ( atenolol 138 +/- 13 mg daily ; labetalol 308 +/- 34 mg daily ; metoprolol 234 +/- 22 mg daily ; and pindolol 24 +/-2 mg daily were required to produce similar antihypertensive effects .
<abstract>
<sent>
<text>A double-blind crossover comparison of pindolol , metoprolol , atenolol and labetalol in mild to moderate hypertension .</text>
<entity class='Intervention' UMLS='C0031937:pindolol' index='T1' start='5'> pindolol </entity>
<entity class='Intervention' UMLS='C0025859:metoprolol' index='T2' start='7'> metoprolol </entity>
<entity class='Intervention' UMLS='C0004147:atenolol' index='T3' start='9'> atenolol </entity>
<entity class='Intervention' UMLS='C0022860:labetalol' index='T4' start='11'> labetalol </entity>
<entity class='Participant' UMLS='C0020538:hypertension' index='T5' start='13'> mild to moderate hypertension </entity>
</sent>
<sent>
<text>1 This study was designed to compare in a double-blind randomized crossover trial , atenolol , labetalol , metoprolol and pindolol .</text>
<entity class='Intervention' UMLS='C0004147:atenolol' index='T6' start='14'> atenolol </entity>
<entity class='Intervention' UMLS='C0022860:labetalol' index='T7' start='16'> labetalol </entity>
<entity class='Intervention' UMLS='C0025859:metoprolol' index='T8' start='18'> metoprolol </entity>
<entity class='Intervention' UMLS='C0031937:pindolol' index='T9' start='20'> pindolol </entity>
</sent>
<sent>
<text>Considerable differences in dose ( atenolol 138 +/- 13 mg daily ; labetalol 308 +/- 34 mg daily ; metoprolol 234 +/- 22 mg daily ; and pindolol 24 +/-2 mg daily were required to produce similar antihypertensive effects .</text>
<attribute class='modifier' index='T10' start='1'> differences </attribute>
<entity class='Intervention' UMLS='C0004147:atenolol' index='T11' start='5'> atenolol </entity>
<attribute class='measure' index='T12' start='6'> 138 +/- 13 mg daily </attribute>
<entity class='Intervention' UMLS='C0022860:labetalol' index='T13' start='12'> labetalol </entity>
<attribute class='measure' index='T14' start='13'> 308 +/- 34 mg daily </attribute>
<entity class='Intervention' UMLS='C0025859:metoprolol' index='T15' start='19'> metoprolol </entity>
<attribute class='measure' index='T16' start='20'> 234 +/- 22 mg daily </attribute>
<entity class='Intervention' UMLS='C0031937:pindolol' index='T17' start='27'> pindolol </entity>
<attribute class='measure' index='T18' start='28'> 24 +/-2 mg daily </attribute>
<entity class='Outcome' UMLS='C0003364:antihypertensive' index='T19' start='37'> antihypertensive effects </entity>
</sent>
</abstract>