biomedical icon indicating copy to clipboard operation
biomedical copied to clipboard

Closes #213

Open mcullan opened this issue 2 years ago • 11 comments

  • Name: AIMed

Checkbox

  • [x] Confirm that this PR is linked to the dataset issue.
  • [x] Create the dataloader script biodatasets/my_dataset/my_dataset.py (please use only lowercase and underscore for dataset naming).
  • [x] Provide values for the _CITATION, _DATASETNAME, _DESCRIPTION, _HOMEPAGE, _LICENSE, _URLs, _SUPPORTED_TASKS, _SOURCE_VERSION, and _BIGBIO_VERSION variables.
  • [x] Implement _info(), _split_generators() and _generate_examples() in dataloader script.
  • [x] Make sure that the BUILDER_CONFIGS class attribute is a list with at least one BigBioConfig for the source schema and one for a bigbio schema.
  • [x] Confirm dataloader script works with datasets.load_dataset function.
  • [x] Confirm that your dataloader script passes the test suite run with python -m tests.test_bigbio biodatasets/my_dataset/my_dataset.py.

So, the script I've attached is still failing, but I wanted to check in with you all about my progress. Just so that you know I'm not ghosting on the project!

This dataset has been very challenging. There are a few reasons for this:

  1. I can't find any meaningful "Source" schema for the dataset, just XML of annotations. I'm not sure what I should do about this, but my current idea is simply to load the XML as text, since that keeps the data as close to the source material as possible. Moreover, I see utility for researchers in parsing the XML themselves if desired.
  2. There are two zipped directories full of annotated abstracts, proteins.tar.gz and interactions.tar.gz. Each of these is annotated in a slightly different format. Both tag proteins with e.g. <prot>protein_name</prot>, but one handles titles and abstracts with HTML tags and the other handles it with plaintext labels e.g. "TI - " for title and "AB - " for abstract. Because the data exists as XML, and especially because there multiple annotation formats in the dataset, it's been very difficult to get passage/entity offsets correct.
  3. I used BeautifulSoup to parse the XML, not realizing the stipulation that we need to stick with packages used by the datasets package. I should be able to rework it using the built-in HTML parser, or, if necessary, could probably swing an entirely regex approach.

I fully intend to finish this PR and the other dataset I'm currently assigned, but it will probably be a day or two before I can look at them again. I have some other projects I have to attend to.

That being said, I really want to continue with this project and finish out this dataset and the other one I'm currently assigned to (see #215) so please bear with me a little longer! I'm curious to know what the timeline is like for getting to the 100% complete milestone, and I definitely want to help get us across the finish line with these :)

mcullan avatar Apr 19 '22 18:04 mcullan

Hey @mcullan thank you very much for your contribution. First of all, as you found out we are dealing with a very nasty dataset here, which will take some effort to get right. But we are here to help :)

I took a look at the paper and this is what I found out...

proteins.tar.gz

The files in proteins.tar.gz are indeed in XML format and the recommended library to use in python is xml and is a built-in library.

If you look at the file protaines/abstract1-1 you'll see the following:

<ArticleTitle><prot><prot>p38</prot> stress-activated protein kinase</prot> inhibitor reverses <prot><prot>bradykinin B(1)</prot> receptor</prot>-mediated component of inflammatory hyperalgesia.</ArticleTitle>
<AbstractText>The effects of a <prot><prot>p38</prot> stress-activated protein kinase</prot>...

These are PubMed entries, so in the bigbio schema you will have 2 passages, one of type "title" and one of type abstract. According to the paper the text is annotated with proteins, this means that the correct schema for these files is Tasks.NAMED_ENTITY_RECOGNITION, which will require to populate the entities field. And now the twist: this dataset contains nested annotations. This means that this line

<prot><prot>p38</prot> stress-activated protein kinase</prot>

needs to be parsed as:

        "entities": [
            {
                "id": <unique id>,
                "type": "protein", 
                "text": "p38",
                "offsets": [TODO],
            }
            {
                "id": <unique id>,
                "type": "protein", 
                "text": "p38 stress-activated protein kinase",
                "offsets": [TODO],
            }
        ],

Parsing in-text xml tags is tricky and the only example I know of that does this is this one but it does not have nested annotations.

interactions.tar.gz

The files in this interactions.tar.gz are even nastier... According to the paper these are always PubMed abstracts (so title + text) and annotate both proteins and their relations. This means that the tasks will be [Tasks.NAMED_ENTITY_RECOGNITION, Tasks.RELATION_EXTRACTION]. The file interactions/abstract_for_8700872 looks like this:

TI - <p1  pair=1 >  <prot>  <prot>  Vascular endothelial growth factor </prot>  - related protein </prot>  </p1>  : a ligand and specific activator of the tyrosine kinase receptor <p2  pair=1 >  <prot>  Flt4 </prot>  </p2>  .
PG - 1988 - 92 AB - The tyrosine kinases <prot>  Flt4 </prot>  , <prot>  Flt1 </prot>  , and 

You will need to parse the entities as above and include as well relations like this:

{
                        "id": <unique id>,
                        "type": "protein-protein relation/interaction",
                        "arg1_id": <unique id of "Vascular endothelial growth factor - related protein>,
                        "arg2_id": <unique id of "Flt4">,
                        "normalized": [],
                    }

I hope this is not too discouraging. Please ping us here if you need further help!

sg-wbi avatar Apr 21 '22 10:04 sg-wbi

Please do not forget to remove from the PR the requirements.txt file. #213 : for reference

sg-wbi avatar Apr 21 '22 10:04 sg-wbi

Regarding the "source" schema I would do keep it as close as possible to the bigbio_kb schema.

sg-wbi avatar Apr 21 '22 12:04 sg-wbi

@sg-wbi Thank you for the detailed, thoughtful feedback!

It also simplifies things that I'm supposed to treat the interactions and proteins sets as separate. Up to this point, I had been parsing them together into a combined NER and RE dataset. It sounds like the appropriate thing to do is to keep them separate, maybe separate them as different splits? Or maybe as two entirely different datasets? e.g. aimed_proteins and aimed_interactions ?

That's also very helpful info about them being Pubmed entries.

Thanks again. Working on this today so should have an update later.

mcullan avatar Apr 25 '22 15:04 mcullan

I am glad I could help!

It sounds like the appropriate thing to do is to keep them separate, maybe separate them as different splits? Or maybe as two entirely different datasets? e.g. aimed_proteins and aimed_interactions ?

Well, I would treat them as a single dataset still, i.e. one dataloader script and have 2 subset_ids, like you said aimed_proteins and aimed_interactions.

sg-wbi avatar Apr 26 '22 10:04 sg-wbi

@sg-wbi I think there's an issue with trying to set proteins and interactions as different splits/subsets.

The interactions subset can be used for RE and NER, but the proteins subset is only suitable for NER. I don't see any way to specify that one subset can be used for multiple tasks and the other can be used for only one task. Your thoughts?

I have a script that loads both in as separate splits, but the proteins split causes test_bigbio to fail because it has no relations.

Thus, it seems that, to pass the test, I would either need to get rid of the RE task for the interactions subset, even though that's what it's made for, or split the subsets into two separate datasets.

mcullan avatar Apr 26 '22 23:04 mcullan

@sg-wbi I just read through those two PRs that you referenced this one in, and this makes a lot more sense to me now! I thought the failure meant that the datasets package required them to have to same tasks.

I need to double check my loader script, though, because I think it missed some relations. After that, I think this should be fine to pass the new tests.

mcullan avatar Apr 27 '22 21:04 mcullan

@mcullan sorry for keeping you waiting. Would you still have time to merge this into a single dataset with multiple subset_ids? We should have a refined version of the tests soon merged into master ( #533 ).

This dataset will count as two since it was particularly nasty to parse :)

sg-wbi avatar May 02 '22 17:05 sg-wbi

@sg-wbi Sounds good :) I have it mostly done already, waiting in a new commit, so could you please ping me here when the tests is finalized and merged? Then I'll get it wrapped up

mcullan avatar May 03 '22 19:05 mcullan

Hey @mcullan thank you for sticking to this dataset so long! We merged the fix for the unit tests. So for the proteins split you should be run something like

python -m tests.test_bigbio biodatasets/aimed/aimed.py --config_name aimed_proteins_bigbio_kb --bypass_keys relations

sg-wbi avatar May 06 '22 14:05 sg-wbi

@mcullan I understad that it has been a while since the hackaton, but when you have time could you please push the commit you mentioned? It does not matter if it's not 100% ready, I can take it from there. This would still count as a dataset you did.

sg-wbi avatar May 12 '22 07:05 sg-wbi