BioGPT BioGPT is now available in 🤗 Transformers

BioGPT is now available for usage in 🤗 Transformers!

Docs: https://huggingface.co/docs/transformers/main/en/model_doc/biogpt.

Checkpoints on the hub: https://huggingface.co/microsoft/biogpt

It'd be very nice if someone converted the remaining BioGPT checkpoints to the HuggingFace format. The conversion script can be found here.

Feb 03 '23 11:02 NielsRogge

https://huggingface.co/kamalkraj/BioGPT-Large-PubMedQA

Screenshot 2023-02-04 at 1 21 23 PM

Screenshot 2023-02-04 at 1 26 48 PM

Feb 04 '23 07:02 kamalkraj

@kamalkraj Where did you find the model dict? It should be inside the checkpoint folder but it is not provided explicitly. I then remembered the change in fairseq where they used to store the dict as a part of the model. But even after loading the model, I was unable to find the dict.

Feb 04 '23 13:02 harveenchadha

Hi @harveenchadha,

Once the model is loaded like this below

import torch
torch.manual_seed(42)

from src.transformer_lm_prompt import TransformerLanguageModelPrompt

model = TransformerLanguageModelPrompt.from_pretrained(
        "../QA-PubMedQA-BioGPT-Large",
        "checkpoint.pt",
        "../QA-PubMedQA-BioGPT-Large",
        max_len_b=1024,
        max_tokens=12000,
        source_lang="x",
        target_lang="y")

You can save the dict using.

m.src_dict.save("new_dict/dict.txt")

Feb 04 '23 14:02 kamalkraj

Hi @kamalkraj,

Thanks for the reply but Looks like to load the model itself you need a dict. What am I doing wrong?

Here is a colab

Feb 04 '23 16:02 harveenchadha

Oh man! I just found out dict and bpecodes are present in data folder itself :D

Feb 04 '23 16:02 harveenchadha

@kamalkraj do you mind converting the other BioGPT checkpoints?

Can I transfer this checkpoint to the Microsoft organization?

Feb 04 '23 17:02 NielsRogge

@kamalkraj do you mind converting the other BioGPT checkpoints?

Can I transfer this checkpoint to the Microsoft organization?

You can transfer https://huggingface.co/kamalkraj/BioGPT-Large-PubMedQA to Microsoft.

I will update in this issue as i convert the other models

Feb 04 '23 18:02 kamalkraj

@NielsRogge https://huggingface.co/kamalkraj/BioGPT-Large

Feb 05 '23 06:02 kamalkraj

Is it possible to fine-tune a model through the huggingface package? Thank you!

Feb 08 '23 01:02 evanbrociner

@harveenchadha were u able to execute in in colab?

Feb 08 '23 06:02 sockthem

@NielsRogge can you help me with Question Answering inferencing documentation from the same model? Got multiple errors.

Feb 08 '23 06:02 sockthem

@evanbrociner yes fine-tuning can be done easily. See our example notebook and example script to fine-tune any GPT-like model (like BioGPT) on your custom dataset.

Feb 08 '23 08:02 NielsRogge

@sockthem sure, note that BioForCausalLM is just a generative model which you can prompt with text and it will continue the prompt. It's not like BertForQuestionAnswering which does extractive question answering from a piece of text.

Feb 08 '23 08:02 NielsRogge

@NielsRogge Thank you for all the amazing help! Another quick question, might a hugging face implementaiton for Fine-tuned BioGPT for document classification task on HoC be in the works?

Feb 09 '23 21:02 evanbrociner

@evanbrociner there's currently a contributor adding a BioGptForSequenceClassification class, which could be used for this purpose. Alternatively, you could fine-tune BioGPT to simply make it generate the appropriate class as next token.

However note that GPT-like (decoder-only Transformer) models oftentimes aren't the best at classification tasks, as they have a causal attention mask instead of a bidirectional attention mask (meaning they can only look at previous tokens when making a prediction, whereas BERT-like or encoder-only Transformers can look in both directions).

For classifying biomedical texts, a model like BioClinicalBERT might work better.

Feb 10 '23 08:02 NielsRogge

Hi,

The BiogGPT checkpoint on transformers can be used for relation extraction on PubMed?

Feb 28 '23 15:02 SalvatoreRa

@evanbrociner there's currently a contributor adding a BioGptForSequenceClassification class, which could be used for this purpose. Alternatively, you could fine-tune BioGPT to simply make it generate the appropriate class as next token.

However note that GPT-like (decoder-only Transformer) models oftentimes aren't the best at classification tasks, as they have a causal attention mask instead of a bidirectional attention mask (meaning they can only look at previous tokens when making a prediction, whereas BERT-like or encoder-only Transformers can look in both directions).

For classifying biomedical texts, a model like BioClinicalBERT might work better.

I found it surprising that BioGPT works better than BioBERT variants in the downstream tasks as shown by BioGPT's paper.

Mar 08 '23 11:03 timothylimyl

@sockthem sure, note that BioForCausalLM is just a generative model which you can prompt with text and it will continue the prompt. It's not like BertForQuestionAnswering which does extractive question answering from a piece of text.

I want to make sure my bioGPT knowledge is correct.

The link below is an example where it seems to only be able to handle Text-Generation tasks. https://colab.research.google.com/drive/1YZxASGlrTOzM5Mxv3yF1rzyxehRa3SIh?usp=sharing#scrollTo=C8uvWlZGOtY_

If I want to try the Relation Extraction task, I need to add and train other modules (e.g. BioGPT-RE-BC5CDR or BioGPT-RE-DDI)

Is that right?

Mar 25 '23 15:03 ZON-ZONG-MIN

Yeah from this list it looks like only 3 models have been converted to the HF format so far.

The conversion script (to convert models from this repository to the HF format) can be found here: https://github.com/huggingface/transformers/blob/main/src/transformers/models/biogpt/convert_biogpt_original_pytorch_checkpoint_to_pytorch.py. cc @kamalkraj

Mar 26 '23 08:03 NielsRogge

Hi @NielsRogge, @kamalkraj,

I wanted to take a stab at converting the fine-tuned models but came up short with the following error:

RuntimeError: Error(s) in loading state_dict for BioGptForCausalLM:
        size mismatch for biogpt.embed_tokens.weight: copying a param with shape torch.Size([42393, 1024]) from checkpoint, the shape in current model is torch.Size([42384, 1024]).
        size mismatch for output_projection.weight: copying a param with shape torch.Size([42393, 1024]) from checkpoint, the shape in current model is torch.Size([42384, 1024]).

Appears that the new model shapes are off by 9 params but I am not sure why. If I am missing something obvious, bare with me as I am just getting my feet wet here. I was able to run the script mentioned above with success on the Pre-trained-BioGPT with no problems at all. Regarding the bpecodes and the dict.txt, I ran the preprocessing step for all the models and copied them from the corresponding /data directories.

I pulled down the checkpoint files for DDI, DTI and BC5CDR as I am interested in trying out some of the NER tasks but I've have not been able to run any of those models successfully using PyTorch as I keep getting the following:

AssertionError: Could not infer task type from {'_name': 'language_modeling_prompt', 'data': 'data', 'sample_break_mode': 'none', 'tokens_per_sample': 1024, 'output_dictionary_size': -1, 'self_target': False, 'future_target': False, 'past_target': False, 'add_bos_token': False, 'max_target_positions': 1024, 'shorten_method': 'none', 'shorten_data_split_list': '', 'pad_to_fixed_length': False, 'pad_to_fixed_bsz': False, 'seed': 1, 'batch_size': None, 'batch_size_valid': None, 'dataset_impl': None, 'data_buffer_size': 10, 'tpu': False, 'use_plasma_view': False, 'plasma_path': '/tmp/plasma', 'source_lang': None, 'target_lang': None, 'max_source_positions': 640, 'manual_prompt': None, 'learned_prompt': 9, 'learned_prompt_pattern': 'learned', 'prefix': False, 'sep_token': '<seqsep>'}. Available argparse tasks: dict_keys(['sentence_prediction', 'sentence_prediction_adapters', 'speech_unit_modeling', 'hubert_pretraining', 'denoising', 'multilingual_denoising', 'translation', 'multilingual_translation', 'translation_from_pretrained_bart', 'translation_lev', 'language_modeling', 'speech_to_text', 'legacy_masked_lm', 'text_to_speech', 'speech_to_speech', 'online_backtranslation', 'simul_speech_to_text', 'simul_text_to_text', 'audio_pretraining', 'semisupervised_translation', 'frm_text_to_speech', 'cross_lingual_lm', 'translation_from_pretrained_xlm', 'multilingual_language_modeling', 'audio_finetuning', 'masked_lm', 'sentence_ranking', 'translation_multi_simple_epoch', 'multilingual_masked_lm', 'dummy_lm', 'dummy_masked_lm', 'dummy_mt']). Available hydra tasks: dict_keys(['sentence_prediction', 'sentence_prediction_adapters', 'speech_unit_modeling', 'hubert_pretraining', 'translation', 'translation_lev', 'language_modeling', 'simul_text_to_text', 'audio_pretraining', 'translation_from_pretrained_xlm', 'multilingual_language_modeling', 'audio_finetuning', 'masked_lm', 'dummy_lm', 'dummy_masked_lm'])

I can easily be something wrong here but being able to run the PreTrained model via PyTorch and through the HF conversion script but not the others makes me think there is something off with the fine-tuned checkpoint files - checkpoint_avg.pt

Cheers

Apr 21 '23 00:04 esko22

HI, I want to perform Question-Answering using BioGPT. Could you please help me in that one?

Oct 24 '23 12:10 TRGanesh