biobert-pytorch icon indicating copy to clipboard operation
biobert-pytorch copied to clipboard

Predictions in raw data

Open GuillermoJaca opened this issue 5 years ago • 6 comments

Hello, I am wondering how predictions on raw data can be done. It is not documented at all for this and I think it's the primary use of the model.

GuillermoJaca avatar Dec 01 '20 13:12 GuillermoJaca

Hi @GuillermoJaca, what do you mean by the raw data? I think the pre-processing will depend on the type of task you want.

jhyuklee avatar Dec 03 '20 02:12 jhyuklee

I mean a normal biomedical text. The issue is that there is no .predict function, so the file run_ner.py has to be customized. What is the best way to do that? Which preprocessing should I use to get the best possible performance of the model taking into account that my task is NER ?

GuillermoJaca avatar Dec 03 '20 06:12 GuillermoJaca

Instruction on using the repo for inference is in the README under the NER section: https://github.com/dmis-lab/biobert#user-content-named-entity-recognition-ner:~:text=You%20can%20change%20the%20arguments%20as,using%20%2D%2Ddo_train%3Dfalse%20%2D%2Ddo_predict%3Dtrue%20for%20evaluating%20test.tsv.

The bigger challenge is completing inference without using the repo, ie, repo specific functions and methods.

mgavish avatar Dec 11 '20 19:12 mgavish

@GuillermoJaca for prediction you can directly use your fine tune model in huggingface transformer pipeline, some sample code below for you reference:

from transformers import AutoTokenizer, AutoModelForTokenClassification
from transformers import pipeline
tokenizer = AutoTokenizer.from_pretrained("finetue_model_path")
model = AutoModelForTokenClassification.from_pretrained("finetue_model_path")
nlp=pipeline(task='ner',model=model,tokenizer=tokenizer,grouped_entities=True,ignore_subwords=True)
text="""he is feeing very sick"""
output=nlp(text)

Read more here on huggingface pipeline: https://huggingface.co/transformers/main_classes/pipelines.html

abhibisht89 avatar Jan 06 '21 11:01 abhibisht89

@abhibisht89 Thank you for your reply.

However, if tokenizer is specified as 'dmis-lab/biobert-v1.1', the ignore_subwords option cannot be specified as True.

Is there any other way?

nowhyun avatar Jan 11 '21 07:01 nowhyun

Hello, I wonder why the labels are the simple BIO in NER task, however, in the raw dataset (e.g. NCBI), the labels could be SpecificDisease, Modifier and so on.

cutejue avatar Mar 29 '21 02:03 cutejue