rust-bert icon indicating copy to clipboard operation
rust-bert copied to clipboard

How to add suppor for the CodeBERT model?

Open yijunyu opened this issue 2 years ago • 10 comments

I wonder how to support CodeBERT: https://huggingface.co/microsoft/codebert-base. It is popular in deep code learning and is based on RoBERTa. The python code base is here: https://github.com/microsoft/CodeBERT. If you don't have time, please give me some advices on how to achieve this, thank you !

yijunyu avatar Feb 10 '22 22:02 yijunyu

Hello @yijunyu ,

Thank you for your interest in the library and using CodeBERT. The RoBERTa architecture is already available in the library -- I have converted the weights and uploaded them at https://huggingface.co/microsoft/codebert-base. Can you please try loading them and let me know if this works for you?

For reference weights stored on the model hub can be loaded and cached using the following:

use rust_bert::bert::BertConfig;
use rust_bert::resources::{RemoteResource, Resource};
use rust_bert::roberta::{RobertaConfigResources, RobertaForMaskedLM};
use rust_bert::Config;
use rust_tokenizers::tokenizer::RobertaTokenizer;
use tch::{nn, Device};

fn main() -> anyhow::Result<()> {
    //    Resources paths
    let config_resource = Resource::Remote(RemoteResource {
        url: "https://huggingface.co/microsoft/codebert-base/resolve/main/config.json".into(),
        cache_subdir: "codebert-base/config".into(),
    });
    let vocab_resource = Resource::Remote(RemoteResource {
        url: "https://huggingface.co/microsoft/codebert-base/resolve/main/vocab.json".into(),
        cache_subdir: "codebert-base/vocab".into(),
    });
    let merges_resource = Resource::Remote(RemoteResource {
        url: "https://huggingface.co/microsoft/codebert-base/resolve/main/merges.txt".into(),
        cache_subdir: "codebert-base/merges".into(),
    });
    let weights_resource = Resource::Remote(RemoteResource {
        url: "https://huggingface.co/microsoft/codebert-base/resolve/main/rust_model.ot".into(),
        cache_subdir: "codebert-base/model".into(),
    });

    let config_path = config_resource.get_local_path()?;
    let vocab_path = vocab_resource.get_local_path()?;
    let merges_path = merges_resource.get_local_path()?;
    let weights_path = weights_resource.get_local_path()?;

    let tokenizer = RobertaTokenizer::from_file(
        vocab_path.to_str().unwrap(),
        merges_path.to_str().unwrap(),
        false,
        false,
    )?;

    let mut vs = nn::VarStore::new(Device::cuda_if_available());
    let config = BertConfig::from_file(config_path);
    let model = RobertaForMaskedLM::new(&vs.root(), &config);
    vs.load(weights_path);

    Ok(())
}

guillaume-be avatar Feb 11 '22 20:02 guillaume-be

Thanks a lot for the initial import. I was able to run the above script, which loads the original codebert model. However, I am not sure how to load a fine-tuned codebert model as a rust_model.ot.

At the moment, we could generate a fine-tuned model using pytorch, but the script utils/convert_model.py is unable to convert. The fine-tuned architecture was created like this:

        self.codeBert = RobertaModel.from_pretrained("microsoft/codebert-base")
        self.fc = nn.Linear(768,768)
        self.classifier = nn.Linear(768,2)
        ...
       torch.save(model, "codeBERT_pl.bin")

Our fine-tuned codebert model could be downloaded from here: http://bertrust.s3.amazonaws.com/codeBERT_pl.bin. I wonder whether it is possible to convert this one into Rust_Bert. Thank you.

yijunyu avatar Feb 12 '22 08:02 yijunyu

Hello @yijunyu ,

I believe the weight conversion should work fine. Can you please post a stack trace if that is not the case? With this conversion working, you will still not be able to load the weights directly in the architectures proposed in this library.

You have created a custom architecture which is not readily available in this crate. I see two solutions for loading your model in this library:

  1. Use a standard architecture from Transformers. It seems you are trying to do Sequence classification or Token classification. Is it possible to re-use of the the architectures proposed in the Python library, such as RobertaForSequenceClassification? You'd then be able to read the weights readily since this architecture is implemented in the crate
  2. Create a Rust wrapper around RobertaModel in Rust, similarly to what you have done in Python. You have to define the layers and the forward operations in Rust with the names matching your parameters dictionary. You can look at the Rust implementation of RobertaForSequenceClassification as an example and adapt for your model.

guillaume-be avatar Feb 13 '22 14:02 guillaume-be

Thanks for the guidance. I will tackle the problem once the conversion can be done. Currently the stack trace is listed below:

$ python utils/convert_model.py codeBERT_pl.bin 
Traceback (most recent call last):
  File "/home/ubuntu/Documents/github.com/guillaume-be/rust-bert/utils/convert_model.py", line 21, in <module>
    for k, v in weights.items():
  File "/home/ubuntu/anaconda3/lib/python3.9/site-packages/torch/nn/modules/module.py", line 947, in __getattr__
    raise AttributeError("'{}' object has no attribute '{}'".format(
AttributeError: 'RobertaClass' object has no attribute 'items'

When print(weights) is inserted to the convert_model.py at line 20, I got the following dump:

RobertaClass(
  (codeBert): RobertaModel(
    (embeddings): RobertaEmbeddings(
      (word_embeddings): Embedding(50265, 768, padding_idx=1)
      (position_embeddings): Embedding(514, 768, padding_idx=1)
      (token_type_embeddings): Embedding(1, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): RobertaEncoder(
      (layer): ModuleList(
    ...
  )
  (fc): Linear(in_features=768, out_features=768, bias=True)
  (classifier): Linear(in_features=768, out_features=2, bias=True)
)

The last two lines may indicate that the model we saved has two more attributes:

        self.codeBert = RobertaModel.from_pretrained("microsoft/codebert-base")
        self.fc = nn.Linear(768,768)
        self.classifier = nn.Linear(768,2)

However, the stack trace concerns a missing "item" attribute in weights, from the printed "weights" variable I don't see where to locate the attribute directly from the model.

yijunyu avatar Feb 14 '22 15:02 yijunyu

Hello @yijunyu , Could you please share the Python code that was used to serialize the model?

guillaume-be avatar Feb 19 '22 11:02 guillaume-be

The main code is listed below:

import torch
import torch.nn as nn
import torch.nn.functional as F
from transformers import RobertaConfig, RobertaModel

class RobertaClass(torch.nn.Module):
    def __init__(self):
        super(RobertaClass, self).__init__()
        self.codeBert = RobertaModel.from_pretrained("microsoft/codebert-base")
        self.fc = nn.Linear(768,768)
        self.classifier = nn.Linear(768,2)

    def forward(self, input_ids, attention_mask, token_type_ids):
        roberta_out = self.codeBert(input_ids = input_ids, attention_mask=attention_mask, token_type_ids=token_type_ids)
        hidden_state = roberta_out[0]
        pooler = hidden_state[:,0]
        pooler = F.relu(self.fc(pooler))
        pooler = F.dropout(pooler, 0.3)
        output = self.classifier(pooler)
        return output

model = RobertaClass()
model.to(device)
...
for epoch_i in range(0, opt.epochs):
      torch.save(model, 'codeBERT_pl.bin')
      model.train()

yijunyu avatar Feb 20 '22 13:02 yijunyu

Can you try saving the model state dict instead of the model itself, i.e.

torch.save(model.state_dict(), 'codeBERT_pl.bin')

guillaume-be avatar Feb 20 '22 13:02 guillaume-be

The above conversion works! I can save the model into rust_model.ot and load it into the memory using the code fragment you shared:

let model = RobertaForMaskedLM::new(&vs.root(), &config);

Now I am at the next step, that is, to convert the custom architecture. I use the first alternative, RobertaSequenceClass,

let model = RobertaSequenceClassification::new(&vs.root(), &config);

It can still load the model, but there seems to be missing the num_labels which should be 2 in our case.

thread 'main' panicked at 'num_labels not provided in configuration', /home/ubuntu/.cargo/registry/src/github.com-1ecc6299db9ec823/rust-bert-0.17.0/src/roberta/roberta_model.rs:374:14

It is not found in the original configuration: https://huggingface.co/microsoft/codebert-base/resolve/main/config.json I wonder whether I shall take the second way you suggested and feed it with this missing num_labels variable.

Here is the specific python code where we extend the RobertaClass:

class RobertaClass(torch.nn.Module):
    def __init__(self):
        super(RobertaClass, self).__init__()
        self.codeBert = RobertaModel.from_pretrained("microsoft/codebert-base")
        self.fc = nn.Linear(768,768)
        self.classifier = nn.Linear(768,2)

    def forward(self, input_ids, attention_mask, token_type_ids):
        roberta_out = self.codeBert(input_ids = input_ids, attention_mask=attention_mask, token_type_ids=token_type_ids)
        hidden_state = roberta_out[0]
        pooler = hidden_state[:,0]
        pooler = F.relu(self.fc(pooler))
        pooler = F.dropout(pooler, 0.3)
        output = self.classifier(pooler)
        return output

How shall I add a new forward_t function into the adaptation ?

yijunyu avatar Feb 21 '22 22:02 yijunyu

Hello @yijunyu ,

I believe you can still use the first way and this should be an easy fix. For SequenceClassification, the models expect a mapping from label id to name, see for example in https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english/blob/main/config.json:

"id2label": {
    "0": "NEGATIVE",
    "1": "POSITIVE"
}

Can you please update your configuration to provide the same mapping and try again?

guillaume-be avatar Feb 22 '22 17:02 guillaume-be

Thanks @guillaume-be. I tried the following change by adapting what's in the url "https://huggingface.co/microsoft/codebert-base/resolve/main/config.json":

{
   ...
  "id2label": {
          "0": "positive",
          "1": "negative"
  },
  "label2id": {
          "positive": 0,
          "negative": 1
  }
}

Then I changed the sequence_classification.rs in the examples folder in the following lines:

    let config_resource = Resource::Remote(RemoteResource {
        url: "config.json".into(),
        // url: "https://huggingface.co/microsoft/codebert-base/resolve/main/config.json".into(),
        cache_subdir: "codebert-base/config".into(),
    });
    let merges_resource = Resource::Remote(RemoteResource {
        url: "https://huggingface.co/microsoft/codebert-base/resolve/main/merges.txt".into(),
        cache_subdir: "codebert-base/merges".into(),
    });
    let vocab_resource = Resource::Remote(RemoteResource {
        url: "https://huggingface.co/microsoft/codebert-base/resolve/main/vocab.json".into(),
        cache_subdir: "codebert-base/vocab".into(),
    });
    let weights_resource = Resource::Remote(RemoteResource {
        url: "https://bertrust.s3.amazonaws.com/rust_model.ot".into(),
        cache_subdir: "codebert-base/model".into(),
    });
    let config = SequenceClassificationConfig::new(ModelType::Roberta,
           weights_resource,
           config_resource,
           vocab_resource,
           Some(merges_resource),
           true, //lowercase
           None, //strip_accents
           None, //add_prefix_space
    );

Now the classifier can run, but the performance is not good. I then try to adapt the src/roberta/roberta_model.rs as follows:

pub struct RobertaForSequenceClassification {
    roberta: BertModel<RobertaEmbeddings>,
    classifier: RobertaClassificationHead,
    fc: nn::Linear,
    dropout: Dropout,
}
...

pub fn new<'p, P>(p: P, config: &BertConfig) -> RobertaForSequenceClassification
    where
        P: Borrow<nn::Path<'p>>,
    {
        let p = p.borrow();
        let roberta =
            BertModel::<RobertaEmbeddings>::new_with_optional_pooler(p / "roberta", config, true);
        let classifier = RobertaClassificationHead::new(p / "classifier", config);
        let linear_config = tch::nn::LinearConfig::default();
        let fc = nn::linear(p, 768, 768, linear_config);
        let dropout = Dropout::new(0.3);

        RobertaForSequenceClassification {
            roberta,
            classifier,
            fc,
            dropout,
        }
    }

    pub fn forward_t(
        &self,
        input_ids: Option<&Tensor>,
        mask: Option<&Tensor>,
        token_type_ids: Option<&Tensor>,
        position_ids: Option<&Tensor>,
        input_embeds: Option<&Tensor>,
        train: bool,
    ) -> RobertaSequenceClassificationOutput {
        let base_model_output = self
            .roberta
            .forward_t(
                input_ids,
                mask,
                token_type_ids,
                position_ids,
                input_embeds,
                None,
                None,
                train,
            )
            .unwrap();
        let pooled_output = &base_model_output.hidden_state.apply(&self.fc).relu().apply_t(&self.dropout, train);
        let logits = self.classifier.forward_t(&pooled_output, train);
        RobertaSequenceClassificationOutput {
            logits,
            all_hidden_states: base_model_output.all_hidden_states,
            all_attentions: base_model_output.all_attentions,
        }
    }

I am not sure if the pooler in Python needs to be explicitly used because there is a boolean argument for the initializer of the roberta field:

        let roberta =
            BertModel::<RobertaEmbeddings>::new_with_optional_pooler(p / "roberta", config, true);

If it is not for this purpose, then maybe I need to use a pooler as shown in the python code.

    def forward(self, input_ids, attention_mask, token_type_ids):
        roberta_out = self.codeBert(input_ids = input_ids, attention_mask=attention_mask, token_type_ids=token_type_ids)
        hidden_state = roberta_out[0]
        pooler = hidden_state[:,0]
        pooler = F.relu(self.fc(pooler))
        pooler = F.dropout(pooler, 0.3)
        output = self.classifier(pooler)
        return output

yijunyu avatar Feb 23 '22 07:02 yijunyu

Closed via https://github.com/guillaume-be/rust-bert/pull/282, https://github.com/guillaume-be/rust-bert/pull/322

guillaume-be avatar Jan 20 '23 19:01 guillaume-be