biobert Questionable training labels in GAD RE dataset

Questionable training labels in GAD RE dataset

Open amirkdv opened this issue 3 years ago • 2 comments

Current State

Looking at the GAD RE data as used in the BioBert paper and linked to in this repo, I can't find any way of making sense of it. Specifically, what do labels 1 and 0 supposed to mean? The options I've considered are:

one of them (presumably 1) means sentence supports a relation between identified gene and disease, the other the opposite.
one of them (presumably 1) means sentence supports a positive relation, the other the opposite (i.e. negative relation or no relation).

Neither of these line up with the data I see.

Here are a few examples from the first 2 cross-validation splits (i.e. subfolders 1 and 2). These are not cherry picked, my only bias was looking for short sentences that are easy to wrap my head around :)

label   sentence
0       No evidence was obtained that the studied polymorphism
        in @GENE$ is a determinant of the coumarin-associated @DISEASE$.
0       There is no allelic association between @DISEASE$ and @GENE$
        gene polymorphisms.
1       C1014T SNP of @GENE$ does not appear to be associated with
        susceptibility to @DISEASE$ in Japanese patients.
1       These data strongly suggest that @GENE$ is not a significant
        susceptibility allele for @DISEASE$.
0       our study does not support the notion that @GENE$ and HTR1A gene
        variants are major contributors to @DISEASE$-, anger-, or
        aggression-related behaviors in our sample.
1       it is unlikely that common variants in MLH1, MLH3, @GENE$, MSH2,
        MSH3 and MSH6 contribute significantly to @DISEASE$ susceptibility.
0       The ESRRA23 and Pro116Pro variants of the gene encoding @GENE$ are
        not associated with @DISEASE$, type 2 diabetes or related
        quantitative traits in the examined Danish whites.
-----
0       Abnormal @GENE$ gene copy numbers are a genetic risk factor in @DISEASE$.
1       Presence of the @GENE$ gene promoter polymorphisms was found to
        be a negative prognostic parameter in patients with @DISEASE$.
0       We conclude that  the @GENE$ gene Bst U I polymorphism is a
        suitable genetic marker of @DISEASE$.
0       We identified a polymorphism in the @GENE$ gene associated with @DISEASE$.
0       We conclude that  @GENE$ is associated with both the development
        of @DISEASE$ and ABO incompatibility.
1       The @GENE$ gene is likely to be involved in the genetic
        vulnerability for @DISEASE$.
0       The @GENE$ Asp allele may be a genetic risk factor for @DISEASE$,
        and might influence the course of Alzheimer disease, even though
        effects vary in different studies.
1       We conclude that  the @GENE$ gene may be a susceptibility gene for @DISEASE$.
1       In conclusion, we have replicated the association of the @GENE$ P2
        promoter haplotype with @DISEASE$ in a U.K. Caucasian population
        where there is no evidence of linkage to 20q.
0       Polymorphisms related to a functional decrease in ligand binding
        activity of @GENE$ are associated with @DISEASE$ in U.
1       Variants of the ADRB2, ADRA1d and @GENE$ genes may be related to a
        predisposition to @DISEASE$.
---
0  (*)  The presence of the @GENE$ Met66 allele does not contribute to the
        decreased level of @DISEASE$(67) mRNA expression in the prefrontal
        cortex of subjects with schizophrenia.

Notes:

The first group are negative associations (i.e. when I read them as a human, I conclude negative association).
The second group are positive associations. As you can see there seems to be no rhyme or reason to the 0s and 1s.
The last row noted as (*) is the closest I've found to an example of no positive or negative association (just co-occurence, probably caused by NER misclassification).

Desired State

Either:

all samples in first half have the same label and all samples in second half have the opposite label.
all samples with the exception of (*) have the same label, and the (*) sample has the opposite label.

Where did the GAD RE dataset come from?

The GAD RE dataset has taken a life of its own (e.g. BLURB also uses it, and defers to BioBert for details), yet it's unclear to me what its origins are. It would be valuable to the community if the maintainers/authors (@wonjininfo ?) could elaborate on how this dataset came to be or at least confirm my understanding described below.

Here's what I could gather about the genealogy of the GAD RE dataset:

Becker et al. (2004) presented GAD (I'll call this NIH GAD), a semi-automatically curated repository of associations (positive and negative -- i.e. evidence of lack of association) between genes and human diseases. At the time of publication (2004), it contained >5,000 data points. The original GAD did not have supporting text, but just pubmed IDs.

The original NIH GAD was retired in 2014. There is a zip file dump of the latest state of the dataset and from a cursory look, confirms my understanding that NIH GAD did not have supporting text, only pubmed IDs.
Bravo et al. (2015) presented BeFree for biomedical RE. They also present a further-processed version of GAD (I'll call this BeFree GAD) to evaluate their model. They did some non-trivial work on the original NIH GAD data to produce a corpus of sentences with true/false labels (unlike NIH GAD's positive/negative labels) with this logic:
- A (sentence, gene, disease) tuple is true if (roughly) NIH GAD contains a positive or negative assertion about that (gene, disease) pair citing the pubmed article containing that sentence.
- The same is false if: despite gene and disease appearing in the sentence, NIH GAD does not note any positive/negative associations between them citing that article.
The original link for the BeFree GAD corpus is dead (but fwiw, the data presumably exists, in a new shape and form, in the DisGeNet project).
Lee, Yoon, et al. 2019 (i.e. BioBert) cite Brave et al. (2015) for their usage of GAD without any further details, presumably because all they did was divide it up for 10-fold cross-validation. And that's what is available at the link in README of this repo.

A Potential Issue

If all this is correct, the main issue I see is the definition of true/false as set by Bravo et al: the absence of an entry from NIH GAD is probably a poor proxy for whether a sentence supports an association for the purposes of RE.

The curators of NIH GAD could not have conceivably looked at all the extra "false" articles included in BeFree GAD. Additionally, NIH GAD was originally curated in 2004, and only updated until 2014. By its nature, NIH GAD probably has nothing to say about the majority of articles for most given (gene, disease) associations.

Dec 30 '20 23:12 amirkdv

biobert biobert copied to clipboard

Questionable training labels in GAD RE dataset

Current State

Desired State

Where did the GAD RE dataset come from?

A Potential Issue

biobert
biobert copied to clipboard