biobert-pytorch Predictions in raw data

Hello, I am wondering how predictions on raw data can be done. It is not documented at all for this and I think it's the primary use of the model.

Dec 01 '20 13:12 GuillermoJaca

Hi @GuillermoJaca, what do you mean by the raw data? I think the pre-processing will depend on the type of task you want.

Dec 03 '20 02:12 jhyuklee

I mean a normal biomedical text. The issue is that there is no .predict function, so the file run_ner.py has to be customized. What is the best way to do that? Which preprocessing should I use to get the best possible performance of the model taking into account that my task is NER ?

Dec 03 '20 06:12 GuillermoJaca

Instruction on using the repo for inference is in the README under the NER section: https://github.com/dmis-lab/biobert#user-content-named-entity-recognition-ner:~:text=You%20can%20change%20the%20arguments%20as,using%20%2D%2Ddo_train%3Dfalse%20%2D%2Ddo_predict%3Dtrue%20for%20evaluating%20test.tsv.

The bigger challenge is completing inference without using the repo, ie, repo specific functions and methods.

Dec 11 '20 19:12 mgavish

@GuillermoJaca for prediction you can directly use your fine tune model in huggingface transformer pipeline, some sample code below for you reference:

from transformers import AutoTokenizer, AutoModelForTokenClassification
from transformers import pipeline
tokenizer = AutoTokenizer.from_pretrained("finetue_model_path")
model = AutoModelForTokenClassification.from_pretrained("finetue_model_path")
nlp=pipeline(task='ner',model=model,tokenizer=tokenizer,grouped_entities=True,ignore_subwords=True)
text="""he is feeing very sick"""
output=nlp(text)

Read more here on huggingface pipeline: https://huggingface.co/transformers/main_classes/pipelines.html

Jan 06 '21 11:01 abhibisht89

@abhibisht89 Thank you for your reply.

However, if tokenizer is specified as 'dmis-lab/biobert-v1.1', the ignore_subwords option cannot be specified as True.

Is there any other way?

Jan 11 '21 07:01 nowhyun

Hello, I wonder why the labels are the simple BIO in NER task, however, in the raw dataset (e.g. NCBI), the labels could be SpecificDisease, Modifier and so on.

Mar 29 '21 02:03 cutejue