biomedical
biomedical copied to clipboard
Tools for curating biomedical training data for large-scale language modeling
``` In [3]: dsd = load_dataset('bigbio/biodatasets/psytar/psytar.py', name='psytar_bigbio_text', data_dir='/home/galtay/data/ ...: bigbio/psytar/PsyTAR_dataset.xlsx') Using custom data configuration psytar_bigbio_text-7247dd615c830efa Reusing dataset psy_tar_dataset (/home/galtay/.cache/huggingface/datasets/psy_tar_dataset/psytar_bigbio_text-7247dd615c830efa/1.0.0/149b2465b2445f8a388bc2f7af48f0d136d246f718f59743564f154ea3c2dfbf) 100%|████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00
What exactly are the differences between the gnormplus dataset and the biocreative II datasets (BC2GM & BC2GN) * https://biocreative.bioinformatics.udel.edu/resources/corpora/biocreative-ii-corpus/ * https://www.ncbi.nlm.nih.gov/research/bionlp/Tools/gnormplus/ currently only gnormplus is implemented https://github.com/bigscience-workshop/biomedical/blob/master/bigbio/biodatasets/gnormplus/gnormplus.py but BLURB uses...
Datasets with k-fold definitions (e.g., GAD) are currently cumbersome to use. Maybe consider always enforcing train/dev/test splits, similar to what BLURB did for HoC and BIOSSES. `source` schema could preserve...
From http://participants-area.bioasq.org/general_information/Task9b/
From https://physionet.org/content/mimic-iii-clinical-action/1.0.0/
From http://www.geniaproject.org/genia-corpus/coreference
From https://github.com/pubmedqa/pubmedqa
From https://species.jensenlab.org
From https://portal.dbmi.hms.harvard.edu/projects/n2c2-nlp/