biomedical icon indicating copy to clipboard operation
biomedical copied to clipboard

Create a dataset loader for CRAFT

Open hakunanatasha opened this issue 3 years ago • 12 comments

Colorado Richly Annotated Full-Text (CRAFT) Corpus

https://github.com/UCDenver-ccp/CRAFT

hakunanatasha avatar Jan 21 '22 22:01 hakunanatasha

#self-assign

uzaymacar avatar Apr 04 '22 22:04 uzaymacar

Hi @uzaymacar, can you let us know if you are still working on this so we can update our project board? Please just notify us the status by Friday April 8, no worries if you are not finished but intend to work on this. Please either ping me here at @hakunanatasha or ping the discord admins (with @admins)

hakunanatasha avatar Apr 06 '22 16:04 hakunanatasha

Hey @hakunanatasha, yes I am still working on this! I am planning to follow up with a PR by mid-next week.

uzaymacar avatar Apr 08 '22 18:04 uzaymacar

@uzaymacar awesome! Feel free to ping me here, via your PR, or on the discord for help! I'm looking forward to your submission :cherry_blossom:

hakunanatasha avatar Apr 09 '22 21:04 hakunanatasha

#self-assign

davidkartchner avatar May 27 '22 15:05 davidkartchner

#self-assign

shamikbose avatar Jun 02 '22 21:06 shamikbose

@jason-fries There's multiple versions of this. I'm using 5.0.0, which is the latest one

shamikbose avatar Jun 02 '22 21:06 shamikbose

SGTM -- just make certain the versioning is reflected in the data loader metadata.

jason-fries avatar Jun 02 '22 21:06 jason-fries

Hi @jason-fries @galtay @ruisi-su I think I'm starting to understand the CRAFT dataset. I have a few questions:

  1. From what I can understand, this dataset support Tasks.COREF and Tasks.NER. Please let me know if there are other tasks it supports
  2. Corefs are somewhat tricky. There are multiple annotations of the same thing. How should that be handled? Here's an example:
        <annotation annotator="Annotator" id="1" type="identity">
            <class id="IDENTITY chain" label="IDENTITY chain"/>
            <span end="71" id="11532192-2" start="65">strain</span>
        </annotation>
        <annotation annotator="CCP Colorado Computational Pharmacology, UC Denver" id="11532192SHM_Instance_150000" type="identity">
            <class id="Noun Phrase" label="Noun Phrase"/>
            <span end="71" id="11532192-3" start="65">strain</span>
        </annotation>
  1. The NER seems to be pretty straightforward, but just to clarify, the covered types are as follows:

    • CHEBI
    • CL
    • GO_BP
    • GO_CC
    • GO_MF
    • MONDO
    • MOP
    • NCBITaxon
    • PR
    • SO
    • UBERON
  2. There's also structural annotations, but I'm not sure which task that would solve in the bigbio schema. Does this need to be implemented?

shamikbose avatar Jun 03 '22 22:06 shamikbose

@ruisi-su This is implemented as a local dataset in #681 since download_and_extract() doesn't seem to work properly with the archive containing the dataset

shamikbose avatar Jun 07 '22 19:06 shamikbose