biomedical
biomedical copied to clipboard
Closes #119 - Add loctext
Fixes #119
If the following information is NOT present in the issue, please populate:
- Name: LocText
- Description: https://pubannotation.org/projects/LocText
- Paper: https://doi.org/10.1186/s12859-018-2021-9
- Data: https://pubannotation.org/projects/LocText
Checkbox
- [x] Confirm that this PR is linked to the dataset issue.
- [x] Create the dataloader script
biodatasets/my_dataset/my_dataset.py
(please use only lowercase and underscore for dataset naming). - [x] Provide values for the
_CITATION
,_DATASETNAME
,_DESCRIPTION
,_HOMEPAGE
,_LICENSE
,_URLs
,_SUPPORTED_TASKS
,_SOURCE_VERSION
, and_BIGBIO_VERSION
variables. - [x] Implement
_info()
,_split_generators()
and_generate_examples()
in dataloader script. - [x] Make sure that the
BUILDER_CONFIGS
class attribute is a list with at least oneBigBioConfig
for the source schema and one for a bigbio schema. - [x] Confirm dataloader script works with
datasets.load_dataset
function. - [x] Confirm that your dataloader script passes the test suite run with
python -m tests.test_bigbio biodatasets/my_dataset/my_dataset.py
. - [x] If my dataset is local, I have provided an output of the unit-tests in the PR (please copy paste). This is OPTIONAL for public datasets, as we can test these without access to the data files.
Hi @hakunanatasha thanks. I will finish this and send by early next week.
Hi @hakunanatasha I have now made the relation arguments map to the entity ID so that we can uniquely resolve them. This is similar to the format used in ddi_corpus
.
data["train"]["entities"][0][:5]
data["train"]["relations"][0][:5]
Will show the following entities
[{'id': '10072396-T1',
'type': 'go',
'text': ['nuclear'],
'offsets': [[46, 53]],
'normalized': [{'db_name': 'go', 'db_id': 'GO:0005634'}]},
{'id': '10072396-T2',
'type': 'go',
'text': ['cytoplasmic'],
'offsets': [[58, 69]],
'normalized': [{'db_name': 'go', 'db_id': 'GO:0005737'}]},
{'id': '10072396-T3',
'type': 'taxonomy',
'text': ['Arabidopsis'],
'offsets': [[86, 97]],
'normalized': [{'db_name': 'taxonomy', 'db_id': '3702'}]},
{'id': '10072396-T4',
'type': 'uniprot',
'text': ['COP1'],
'offsets': [[98, 102]],
'normalized': [{'db_name': 'uniprot', 'db_id': 'P43254'}]},
{'id': '10072396-T5',
'type': 'taxonomy',
'text': ['Arabidopsis'],
'offsets': [[108, 119]],
'normalized': [{'db_name': 'taxonomy', 'db_id': '3702'}]}]
And following relations:
[{'id': '10072396-R1',
'type': 'localizeTo',
'arg1_id': '10072396-T4',
'arg2_id': '10072396-T2',
'normalized': []},
{'id': '10072396-R10',
'type': 'localizeTo',
'arg1_id': '10072396-T29',
'arg2_id': '10072396-T28',
'normalized': []},
{'id': '10072396-R2',
'type': 'localizeTo',
'arg1_id': '10072396-T4',
'arg2_id': '10072396-T1',
'normalized': []},
{'id': '10072396-R3',
'type': 'localizeTo',
'arg1_id': '10072396-T9',
'arg2_id': '10072396-T11',
'normalized': []},
{'id': '10072396-R4',
'type': 'localizeTo',
'arg1_id': '10072396-T9',
'arg2_id': '10072396-T10',
'normalized': []}]
@hakunanatasha can you approve the pr i have already addressed the changes.