biobert icon indicating copy to clipboard operation
biobert copied to clipboard

Running inference on Fine-tuned Biobert model for diseases

Open geethaRam opened this issue 5 years ago • 8 comments

Was able to train and fine-tune the Biobert model for NER task. The validation also worked. Now, looking to use this fine-tuned model to run batch/real-time inference functions.

run_ner.py always expects to run prediction only from test.tsv

Can you provide additional instructions in getting the input sequence (Eg: sequence="This is a sample test to test diseases and disorders") to the test.tsv format ? Or share a script that will tokenize and format the input sequences to the format that run_ner.py expects for prediction ?

Also, another clarification: The label_list as specified here : https://github.com/dmis-lab/biobert/blob/master/run_ner.py#L192 ["[PAD]", "B", "I", "O", "X", "[CLS]", "[SEP]"] So, the num_labels is 7 , correct ? Asking because when loading the Biobert's fine-tuned model into hugging face transformers AutoModelForTokenClassification - it is only predicting in binary labels. No luck in loading the model into transformers' library as pretrainedmodel.

Any help is appreciated

geethaRam avatar Apr 20 '20 01:04 geethaRam

Same initial question - how do I run NER on raw sentences? Which pre-processor scripts should be used to have my data presented in format similar to test.tsv?

gserb-datascientist avatar Jul 20 '20 19:07 gserb-datascientist

I'm also facing the same issue. Any pointers on this? Did you find a workaround? @geethaRam

soycaporal avatar Sep 17 '20 06:09 soycaporal

@jhyuklee @wonjininfo Any help ? This should be better documented, it is very difficult to grasp how to use biobert for anything other than benchmarking on the datasets you provided.

Rkubinski avatar Sep 23 '20 17:09 Rkubinski

@Rkubinski @gserb-datascientist

What I did was to adopt the run_ner.py script for model inference. I had to work around the TFRecordDataset based input parameters and use raw tensor slices.

soycaporal avatar Sep 23 '20 18:09 soycaporal

@wenshutang I think our question is exactly about how to generate these tokenized raw tensor slices in a way that works for run_ner. How did you tokenize your text ?

Rkubinski avatar Sep 25 '20 13:09 Rkubinski

@Rkubinski We've been working on the PyTorch version of BioBERT which should be easier to modify for your datasets. You can see them in https://github.com/dmis-lab/biobert-pytorch. Thanks.

jhyuklee avatar Oct 14 '20 10:10 jhyuklee

@Rkubinski sorry about the delayed reply, in case you are still wondering. I modified the input function builder to create tensor slices for inference.

Biobert-pytorch looks great 👍 , super helpful.

def input_fn_builder(features, seq_length):
    """Creates an `input_fn` closure to be passed to TPUEstimator."""
    all_input_ids = []
    all_input_mask = []
    all_segment_ids = []
    all_label_ids = []

    for feature in features:
        all_input_ids.append(feature.input_ids)
        all_input_mask.append(feature.input_mask)
        all_segment_ids.append(feature.segment_ids)
        all_label_ids.append(feature.label_ids)

    def input_fn(params):
        batch_size = params["batch_size"]
        num_examples = len(features)

        d = tf.data.Dataset.from_tensor_slices(
            {
                # "unique_ids":
                #     tf.constant(all_unique_ids, shape=[num_examples], dtype=tf.int32),
                "input_ids": tf.constant(all_input_ids, shape=[num_examples, seq_length], dtype=tf.int64),
                "input_mask": tf.constant(all_input_mask, shape=[num_examples, seq_length], dtype=tf.int64),
                "segment_ids": tf.constant(all_segment_ids, shape=[num_examples, seq_length], dtype=tf.int64),
                "label_ids": tf.constant(all_label_ids, shape=[num_examples, seq_length], dtype=tf.int64),
            }
        )
        d = d.batch(batch_size=batch_size, drop_remainder=False)
        return d

    return input_fn

soycaporal avatar Oct 15 '20 06:10 soycaporal

@jhyuklee @wenshutang Thank you guys, I appreciate it!

Rkubinski avatar Oct 18 '20 23:10 Rkubinski