biomedical
biomedical copied to clipboard
Create dataloader for PMC-Patients Task 1: Patient Note Recognition (PNR)
Hello, I noticed that this repo has pmc_patients
for PMC-Patients Task 2: Patient-Patient Similarity (PPS), but there was no dataloader for PMC-Patients Task 1: Patient Note Recognition (PNR), so I created this pull request for this addition. I also don't know if it's best to merge this addition to the previous dataloader (pmc_patients
) or not, so for now I make this as a separate dataloader.
Regarding the dataloader schema, since the PMC-Patients PNR is not suitable for all the schemas that have been provided here, I followed @galtay's recommendation (via @SamuelCahyawijaya; thanks for relaying the info to me) to implement the source schema only and leave the _SUPPORTED_TASKS
empty.
Please let me know if there's anything I can help.
- Name: PMC-Patients PNR
- Description: PMC-Patients dataset consists of 4 tasks. One of the task is Patient Note Recognition (PNR). PMC-Patients PNR dataset is modeled as a paragraph-level sequential labeling task, similar to the named entity recognition (NER) task. For each article, given input as a sequence of texts p1, p2, ..., pn, where n is the number of paragraphs, the output is a sequence of BIO tags t1, t2, ..., tn.
- Paper: PMC-Patients: A Large-scale Dataset of Patient Notes and Relations Extracted from Case Reports in PubMed Central
- Data: Google Drive
Checkbox
- [ ] Confirm that this PR is linked to the dataset issue.
- [x] Create the dataloader script
biodatasets/my_dataset/my_dataset.py
(please use only lowercase and underscore for dataset naming). - [x] Provide values for the
_CITATION
,_DATASETNAME
,_DESCRIPTION
,_HOMEPAGE
,_LICENSE
,_URLs
,_SUPPORTED_TASKS
,_SOURCE_VERSION
, and_BIGBIO_VERSION
variables. - [x] Implement
_info()
,_split_generators()
and_generate_examples()
in dataloader script. - [ ] Make sure that the
BUILDER_CONFIGS
class attribute is a list with at least oneBigBioConfig
for the source schema and one for a bigbio schema. - [x] Confirm dataloader script works with
datasets.load_dataset
function. - [x] Confirm that your dataloader script passes the test suite run with
python -m tests.test_bigbio biodatasets/my_dataset/my_dataset.py
. - [ ] If my dataset is local, I have provided an output of the unit-tests in the PR (please copy paste). This is OPTIONAL for public datasets, as we can test these without access to the data files.