biomedical icon indicating copy to clipboard operation
biomedical copied to clipboard

Closes #119 - Add loctext

Open napsternxg opened this issue 2 years ago • 3 comments

Fixes #119

If the following information is NOT present in the issue, please populate:

  • Name: LocText
  • Description: https://pubannotation.org/projects/LocText
  • Paper: https://doi.org/10.1186/s12859-018-2021-9
  • Data: https://pubannotation.org/projects/LocText

Checkbox

  • [x] Confirm that this PR is linked to the dataset issue.
  • [x] Create the dataloader script biodatasets/my_dataset/my_dataset.py (please use only lowercase and underscore for dataset naming).
  • [x] Provide values for the _CITATION, _DATASETNAME, _DESCRIPTION, _HOMEPAGE, _LICENSE, _URLs, _SUPPORTED_TASKS, _SOURCE_VERSION, and _BIGBIO_VERSION variables.
  • [x] Implement _info(), _split_generators() and _generate_examples() in dataloader script.
  • [x] Make sure that the BUILDER_CONFIGS class attribute is a list with at least one BigBioConfig for the source schema and one for a bigbio schema.
  • [x] Confirm dataloader script works with datasets.load_dataset function.
  • [x] Confirm that your dataloader script passes the test suite run with python -m tests.test_bigbio biodatasets/my_dataset/my_dataset.py.
  • [x] If my dataset is local, I have provided an output of the unit-tests in the PR (please copy paste). This is OPTIONAL for public datasets, as we can test these without access to the data files.

napsternxg avatar Apr 24 '22 22:04 napsternxg

Hi @hakunanatasha thanks. I will finish this and send by early next week.

napsternxg avatar Apr 30 '22 04:04 napsternxg

Hi @hakunanatasha I have now made the relation arguments map to the entity ID so that we can uniquely resolve them. This is similar to the format used in ddi_corpus.

data["train"]["entities"][0][:5]
data["train"]["relations"][0][:5]

Will show the following entities

[{'id': '10072396-T1',
  'type': 'go',
  'text': ['nuclear'],
  'offsets': [[46, 53]],
  'normalized': [{'db_name': 'go', 'db_id': 'GO:0005634'}]},
 {'id': '10072396-T2',
  'type': 'go',
  'text': ['cytoplasmic'],
  'offsets': [[58, 69]],
  'normalized': [{'db_name': 'go', 'db_id': 'GO:0005737'}]},
 {'id': '10072396-T3',
  'type': 'taxonomy',
  'text': ['Arabidopsis'],
  'offsets': [[86, 97]],
  'normalized': [{'db_name': 'taxonomy', 'db_id': '3702'}]},
 {'id': '10072396-T4',
  'type': 'uniprot',
  'text': ['COP1'],
  'offsets': [[98, 102]],
  'normalized': [{'db_name': 'uniprot', 'db_id': 'P43254'}]},
 {'id': '10072396-T5',
  'type': 'taxonomy',
  'text': ['Arabidopsis'],
  'offsets': [[108, 119]],
  'normalized': [{'db_name': 'taxonomy', 'db_id': '3702'}]}]

And following relations:

[{'id': '10072396-R1',
  'type': 'localizeTo',
  'arg1_id': '10072396-T4',
  'arg2_id': '10072396-T2',
  'normalized': []},
 {'id': '10072396-R10',
  'type': 'localizeTo',
  'arg1_id': '10072396-T29',
  'arg2_id': '10072396-T28',
  'normalized': []},
 {'id': '10072396-R2',
  'type': 'localizeTo',
  'arg1_id': '10072396-T4',
  'arg2_id': '10072396-T1',
  'normalized': []},
 {'id': '10072396-R3',
  'type': 'localizeTo',
  'arg1_id': '10072396-T9',
  'arg2_id': '10072396-T11',
  'normalized': []},
 {'id': '10072396-R4',
  'type': 'localizeTo',
  'arg1_id': '10072396-T9',
  'arg2_id': '10072396-T10',
  'normalized': []}]

napsternxg avatar May 05 '22 07:05 napsternxg

@hakunanatasha can you approve the pr i have already addressed the changes.

napsternxg avatar May 14 '22 05:05 napsternxg