biobert icon indicating copy to clipboard operation
biobert copied to clipboard

For custom RE dataset with entity marked in advanced

Open KennyNg-19 opened this issue 3 years ago • 1 comments

Hi, as a green hand, I would like to ask some naive questions: for fine-tuning on a custom RE dataset with entity marked in advanced,

  1. do we need to constrain what kind of entity marker or dummy words used for the BioNLP when marking the entity(e.g. @DISEASE$, [e] some disease [/e])?

  2. when preprocessing, do we need add some code for helping the model to tokenized the entity? e.g. if we using [E1] to mark the entity, let the tokenizer knows it:

tokenizer.add_tokens(['[E1]', '[/E1]', '[E2]', '[/E2]', '[BLANK]'])

Hi Chloe,

Yes, you need to input task_name. If your dataset is a task of binary classification, you can use either of them. Basically, euadr and gad are processed in the same way (using BioBERTProcessor). https://github.com/dmis-lab/biobert/blob/37599fb978e3b584a6e9aa9abca1f38588bfff4f/run_re.py#L914-L917

Please be noticed that, however, chemprot dataset is a multi-class classification task. Hence it is processed in a different way and the same holds for the evaluation script.
Thank you for your interest in our work! Best, WonJin

KennyNg-19 avatar Aug 17 '21 12:08 KennyNg-19