biomedical
biomedical copied to clipboard
Closes #213
- Name: AIMed
Checkbox
- [x] Confirm that this PR is linked to the dataset issue.
- [x] Create the dataloader script
biodatasets/my_dataset/my_dataset.py
(please use only lowercase and underscore for dataset naming). - [x] Provide values for the
_CITATION
,_DATASETNAME
,_DESCRIPTION
,_HOMEPAGE
,_LICENSE
,_URLs
,_SUPPORTED_TASKS
,_SOURCE_VERSION
, and_BIGBIO_VERSION
variables. - [x] Implement
_info()
,_split_generators()
and_generate_examples()
in dataloader script. - [x] Make sure that the
BUILDER_CONFIGS
class attribute is a list with at least oneBigBioConfig
for the source schema and one for a bigbio schema. - [x] Confirm dataloader script works with
datasets.load_dataset
function. - [x] Confirm that your dataloader script passes the test suite run with
python -m tests.test_bigbio biodatasets/my_dataset/my_dataset.py
.
So, the script I've attached is still failing, but I wanted to check in with you all about my progress. Just so that you know I'm not ghosting on the project!
This dataset has been very challenging. There are a few reasons for this:
- I can't find any meaningful "Source" schema for the dataset, just XML of annotations. I'm not sure what I should do about this, but my current idea is simply to load the XML as text, since that keeps the data as close to the source material as possible. Moreover, I see utility for researchers in parsing the XML themselves if desired.
- There are two zipped directories full of annotated abstracts,
proteins.tar.gz
andinteractions.tar.gz
. Each of these is annotated in a slightly different format. Both tag proteins with e.g.<prot>protein_name</prot>
, but one handles titles and abstracts with HTML tags and the other handles it with plaintext labels e.g."TI - "
for title and"AB - "
for abstract. Because the data exists as XML, and especially because there multiple annotation formats in the dataset, it's been very difficult to get passage/entity offsets correct. - I used BeautifulSoup to parse the XML, not realizing the stipulation that we need to stick with packages used by the
datasets
package. I should be able to rework it using the built-in HTML parser, or, if necessary, could probably swing an entirely regex approach.
I fully intend to finish this PR and the other dataset I'm currently assigned, but it will probably be a day or two before I can look at them again. I have some other projects I have to attend to.
That being said, I really want to continue with this project and finish out this dataset and the other one I'm currently assigned to (see #215) so please bear with me a little longer! I'm curious to know what the timeline is like for getting to the 100% complete milestone, and I definitely want to help get us across the finish line with these :)
Hey @mcullan thank you very much for your contribution. First of all, as you found out we are dealing with a very nasty dataset here, which will take some effort to get right. But we are here to help :)
I took a look at the paper and this is what I found out...
proteins.tar.gz
The files in proteins.tar.gz
are indeed in XML format and the recommended library to use in python is xml
and is a built-in library.
If you look at the file protaines/abstract1-1
you'll see the following:
<ArticleTitle><prot><prot>p38</prot> stress-activated protein kinase</prot> inhibitor reverses <prot><prot>bradykinin B(1)</prot> receptor</prot>-mediated component of inflammatory hyperalgesia.</ArticleTitle>
<AbstractText>The effects of a <prot><prot>p38</prot> stress-activated protein kinase</prot>...
These are PubMed entries, so in the bigbio
schema you will have 2 passages, one of type
"title" and one of type
abstract. According to the paper the text is annotated with proteins, this means that the correct schema for these files is Tasks.NAMED_ENTITY_RECOGNITION
, which will require to populate the entities
field. And now the twist: this dataset contains nested annotations. This means that this line
<prot><prot>p38</prot> stress-activated protein kinase</prot>
needs to be parsed as:
"entities": [
{
"id": <unique id>,
"type": "protein",
"text": "p38",
"offsets": [TODO],
}
{
"id": <unique id>,
"type": "protein",
"text": "p38 stress-activated protein kinase",
"offsets": [TODO],
}
],
Parsing in-text xml tags is tricky and the only example I know of that does this is this one but it does not have nested annotations.
interactions.tar.gz
The files in this interactions.tar.gz
are even nastier...
According to the paper these are always PubMed abstracts (so title
+ text
) and annotate both proteins and their relations. This means that the tasks will be [Tasks.NAMED_ENTITY_RECOGNITION, Tasks.RELATION_EXTRACTION]
. The file interactions/abstract_for_8700872
looks like this:
TI - <p1 pair=1 > <prot> <prot> Vascular endothelial growth factor </prot> - related protein </prot> </p1> : a ligand and specific activator of the tyrosine kinase receptor <p2 pair=1 > <prot> Flt4 </prot> </p2> .
PG - 1988 - 92 AB - The tyrosine kinases <prot> Flt4 </prot> , <prot> Flt1 </prot> , and
You will need to parse the entities as above and include as well relations
like this:
{
"id": <unique id>,
"type": "protein-protein relation/interaction",
"arg1_id": <unique id of "Vascular endothelial growth factor - related protein>,
"arg2_id": <unique id of "Flt4">,
"normalized": [],
}
I hope this is not too discouraging. Please ping us here if you need further help!
Please do not forget to remove from the PR the requirements.txt
file.
#213 : for reference
Regarding the "source" schema I would do keep it as close as possible to the bigbio_kb
schema.
@sg-wbi Thank you for the detailed, thoughtful feedback!
It also simplifies things that I'm supposed to treat the interactions and proteins sets as separate. Up to this point, I had been parsing them together into a combined NER and RE dataset. It sounds like the appropriate thing to do is to keep them separate, maybe separate them as different splits? Or maybe as two entirely different datasets? e.g. aimed_proteins
and aimed_interactions
?
That's also very helpful info about them being Pubmed entries.
Thanks again. Working on this today so should have an update later.
I am glad I could help!
It sounds like the appropriate thing to do is to keep them separate, maybe separate them as different splits? Or maybe as two entirely different datasets? e.g. aimed_proteins and aimed_interactions ?
Well, I would treat them as a single dataset still, i.e. one dataloader script and have 2 subset_id
s, like you said aimed_proteins
and aimed_interactions
.
@sg-wbi I think there's an issue with trying to set proteins
and interactions
as different splits/subsets.
The interactions
subset can be used for RE and NER, but the proteins
subset is only suitable for NER. I don't see any way to specify that one subset can be used for multiple tasks and the other can be used for only one task. Your thoughts?
I have a script that loads both in as separate splits, but the proteins
split causes test_bigbio
to fail because it has no relations.
Thus, it seems that, to pass the test, I would either need to get rid of the RE task for the interactions subset, even though that's what it's made for, or split the subsets into two separate datasets.
@sg-wbi I just read through those two PRs that you referenced this one in, and this makes a lot more sense to me now! I thought the failure meant that the datasets package required them to have to same tasks.
I need to double check my loader script, though, because I think it missed some relations. After that, I think this should be fine to pass the new tests.
@mcullan sorry for keeping you waiting. Would you still have time to merge this into a single dataset with multiple subset_id
s? We should have a refined version of the tests soon merged into master ( #533 ).
This dataset will count as two since it was particularly nasty to parse :)
@sg-wbi Sounds good :) I have it mostly done already, waiting in a new commit, so could you please ping me here when the tests is finalized and merged? Then I'll get it wrapped up
Hey @mcullan thank you for sticking to this dataset so long! We merged the fix for the unit tests. So for the proteins
split you should be run something like
python -m tests.test_bigbio biodatasets/aimed/aimed.py --config_name aimed_proteins_bigbio_kb --bypass_keys relations
@mcullan I understad that it has been a while since the hackaton, but when you have time could you please push the commit you mentioned? It does not matter if it's not 100% ready, I can take it from there. This would still count as a dataset you did.