adapters icon indicating copy to clipboard operation
adapters copied to clipboard

Training an Adapter using own classification head and pytorch training loop

Open Ch-rode opened this issue 2 years ago • 9 comments

Details

Hello ! I want to add adapter approach in my text-classification pre-trained bert, but I did not find a good explanation in the documentation on how to that. My model class is the following:

class BertClassifier(nn.Module):
    """Bert Model for Classification Tasks."""
    def __init__(self, freeze_bert=True):
        """
         @param    bert: a BertModel object
         @param    classifier: a torch.nn.Module classifier
         @param    freeze_bert (bool): Set `False` to fine-tune the BERT model
        """
        super(BertClassifier, self).__init__()

        # Instantiate BERT model
        # Specify hidden size of BERT, hidden size of our classifier, and number of labels
        self.bert = BertAdapterModel.from_pretrained(PREETRAINED_MODEL')
        self.D_in = 1024 
        self.H = 512
        self.D_out = 2
        

        # Add a new adapter
        self.bert.add_adapter("thermo_cl",set_active=True)
        self.bert.train_adapter(["thermo_cl"])

 
        # Instantiate the classifier head with some one-layer feed-forward classifier
        self.classifier = nn.Sequential(
            nn.Linear(self.D_in, 512),
            nn.Tanh(),
            nn.Linear(512, self.D_out),
            nn.Tanh()
        )
 
         # Freeze the BERT model
        if freeze_bert:
            for param in self.bert.parameters():
                param.requires_grad = True


    def forward(self, input_ids, attention_mask):
        ''' Feed input to BERT and the classifier to compute logits.
         @param    input_ids (torch.Tensor): an input tensor with shape (batch_size,
                       max_length)
         @param    attention_mask (torch.Tensor): a tensor that hold attention mask
                       information with shape (batch_size, max_length)
         @return   logits (torch.Tensor): an output tensor with shape (batch_size,
                       num_labels) '''
         # Feed input to BERT
        outputs = self.bert(input_ids=input_ids,
                             attention_mask=attention_mask)
         
         # Extract the last hidden state of the token `[CLS]` for classification task
        last_hidden_state_cls = outputs[0][:, 0, :]
 
         # Feed input to classifier to compute logits
        logits = self.classifier(last_hidden_state_cls)
 
        return logits

The training loop is the following:

def initialize_model(epochs):
    """ Initialize the Bert Classifier, the optimizer and the learning rate scheduler."""
    # Instantiate Bert Classifier
    bert_classifier = BertClassifier(freeze_bert=False) #false=freezed

    # Tell PyTorch to run the model on GPU
    bert_classifier = bert_classifier.to(device)

    # Create the optimizer
    optimizer = AdamW(bert_classifier.parameters(),
                      lr=lr,    # Default learning rate
                      eps=1e-8    # Default epsilon value
                      )

    # Total number of training steps
    total_steps = len(train_dataloader) * epochs

    # Set up the learning rate scheduler
    scheduler = get_linear_schedule_with_warmup(optimizer,
                                                num_warmup_steps=0, # Default value
                                                num_training_steps=total_steps)

    return bert_classifier, optimizer, scheduler

def train(model, train_dataloader, val_dataloader, valid_loss_min_input, checkpoint_path, best_model_path, start_epochs, epochs, evaluation=True):

    """Train the BertClassifier model."""
    # Start training loop
    logging.info("--Start training...\n")

    # Initialize tracker for minimum validation loss
    valid_loss_min = valid_loss_min_input 


    for epoch_i in range(start_epochs, epochs):

                          ..............................

     if evaluation == True:
            # After the completion of each training epoch, measure the model's performance
            # on our validation set.
            val_loss, val_accuracy = evaluate(model, val_dataloader)

            # Print performance over the entire training data
            time_elapsed = time.time() - t0_epoch
            
            logging.info(f"{epoch_i + 1:^7} | {'-':^7} | {avg_train_loss:^12.6f} | {val_loss:^10.6f} | {val_accuracy:^10.6f} | {time_elapsed:^9.2f}")

            logging.info("-"*70)
        logging.info("\n")

         # create checkpoint variable and add important data
        checkpoint = {
            'epoch': epoch_i + 1,
            'valid_loss_min': val_loss,
            'state_dict': model.state_dict(),
            'optimizer': optimizer.state_dict(),
        }
        
        # save checkpoint
        save_ckp(checkpoint, False, checkpoint_path, best_model_path)
        
        ## TODO: save the model if validation loss has decreased
        if val_loss <= valid_loss_min:
            print('Validation loss decreased ({:.6f} --> {:.6f}).  Saving model ...'.format(valid_loss_min,val_loss))
            # save checkpoint as best model
            save_ckp(checkpoint, True, checkpoint_path, best_model_path)
            valid_loss_min = val_loss


    model.save_adapter("./final_adapter", "thermo_cl")
    logging.info("-----------------Training complete--------------------------")

bert_classifier, optimizer, scheduler = initialize_model(epochs=n_epochs)
train(model = bert_classifier....)

As you can see I have my own personalized classification head, so I don't want to use the .add_classification_head() method. Is it correct to train and activate the adapter in this way? I would like to know if I'm using adapter properly and also how to save the checkpoint and my model weights because at the end of the training (where i suppose to save the adapter) I receive this error:

AttributeError: 'BertClassifier' object has no attribute 'save_adapter'

Thanks for the help!

Ch-rode avatar May 10 '22 08:05 Ch-rode

Hey @Ch-rode,

your adapter adding and activation code looks right. For saving the adapter, you should call model.bert.save_adapter("./final_adapter", "thermo_cl") since your model is an instance of your custom class and does not have a save_adapter() method.

calpt avatar May 13 '22 16:05 calpt

Thanks for the answer @calpt SO, to be sure, I don't need to activate a head keeping the same name of the adapter, right? Because in this tutorial it says something like that. Another question, Have I to save the adapter after all the training right? not as a checkpoint. Like the following: Thanks for the help!

def train(model, train_dataloader, val_dataloader, valid_loss_min_input, checkpoint_path, best_model_path, start_epochs, epochs, evaluation=True):

    # Start training loop
    logging.info("--Start training...\n")

    # Initialize tracker for minimum validation loss
    valid_loss_min = valid_loss_min_input 


    for epoch_i in range(start_epochs, epochs):

        
        # =======================================
        #               Training
        # =======================================
        # Print the header of the result table
        logging.info((f"{'Epoch':^7} | {'Batch':^7} | {'Train Loss':^12} | {'Val Loss':^10} | {'Val Acc':^9} | {'Elapsed':^9}"))

        # Measure the elapsed time of each epoch
        t0_epoch, t0_batch = time.time(), time.time()

        # Reset tracking variables at the beginning of each epoch
        total_loss, batch_loss, batch_counts = 0, 0, 0

        # Put the model into the training mode
        model.train()

        # For each batch of training data...
        for step, batch in enumerate(train_dataloader):
            batch_counts +=1
            # Load batch to GPU
            b_input_ids, b_attn_mask, b_labels = tuple(t.to(device) for t in batch)

            # Zero out any previously calculated gradients
            model.zero_grad()

            # Perform a forward pass. This will return logits.
            logits = model(b_input_ids, b_attn_mask)

            # Compute loss and accumulate the loss values
            loss = loss_fn(logits, b_labels)
            batch_loss += loss.item()
            total_loss += loss.item()

            # Perform a backward pass to calculate gradients
            loss.backward()

            # Clip the norm of the gradients to 1.0 to prevent "exploding gradients"
            torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)

            # Update parameters and the learning rate
            optimizer.step()
            scheduler.step()

            # Print the loss values and time elapsed for every 20 batches
            if (step % 500 == 0 and step != 0) or (step == len(train_dataloader) - 1):
                # Calculate time elapsed for 20 batches
                time_elapsed = time.time() - t0_batch

                # Print training results
                logging.info(f"{epoch_i + 1:^7} | {step:^7} | {batch_loss / batch_counts:^12.6f} | {'-':^10} | {'-':^9} | {time_elapsed:^9.2f}")

                # Reset batch tracking variables
                batch_loss, batch_counts = 0, 0
                t0_batch = time.time()

        # Calculate the average loss over the entire training data
        avg_train_loss = total_loss / len(train_dataloader)

        logging.info("-"*70)
        # =======================================
        #               Evaluation
        # =======================================
        if evaluation == True:
            # After the completion of each training epoch, measure the model's performance
            # on our validation set.
            val_loss, val_accuracy = evaluate(model, val_dataloader)

            # Print performance over the entire training data
            time_elapsed = time.time() - t0_epoch
            
            logging.info(f"{epoch_i + 1:^7} | {'-':^7} | {avg_train_loss:^12.6f} | {val_loss:^10.6f} | {val_accuracy:^10.6f} | {time_elapsed:^9.2f}")

            logging.info("-"*70)
        logging.info("\n")


         # create checkpoint variable and add important data
        checkpoint = {
            'epoch': epoch_i + 1,
            'valid_loss_min': val_loss,
            'state_dict': model.state_dict(),
            'optimizer': optimizer.state_dict(),
        }
        
        # save checkpoint
        save_ckp(checkpoint, False, checkpoint_path, best_model_path)
        
        ## TODO: save the model if validation loss has decreased
        if val_loss <= valid_loss_min:
            print('Validation loss decreased ({:.6f} --> {:.6f}).  Saving model ...'.format(valid_loss_min,val_loss))
            # save checkpoint as best model
            save_ckp(checkpoint, True, checkpoint_path, best_model_path)
            valid_loss_min = val_loss
            
    model.bert.save_adapter("./final_adapter", "thermo_cl")
    logging.info("-----------------Training complete--------------------------")


Ch-rode avatar May 18 '22 12:05 Ch-rode

To resume the model I'm doing something like that, is it correct?

bert_classifier, optimizer, scheduler = initialize_model(epochs=n_epochs)

model, optimizer, start_epoch, valid_loss_min = load_ckp(r"./best_model/best_model.pt", bert_classifier, optimizer)

model.load_adapter("./final_adapter", model_name=model)

model.set_active_adapters("thermo_cl")

Ch-rode avatar May 18 '22 13:05 Ch-rode

SO, to be sure, I don't need to activate a head keeping the same name of the adapter, right? Because in this tutorial it says something like that.

You don't have to do this if you're using a custom prediction head on top of the model (as you are doing). The BertAdapterModel provides built-in head implementations for common tasks (such as classification) which you can use if they fit your use case. The comment about keeping a head with the same name as the adapter enables automatic loading and saving of the head together with the adapter. As you have your own prediction head, you can just ignore everything about the built-in prediction heads.

To resume the model I'm doing something like that, is it correct?

Yes, your code for saving and loading the adapter looks good to me. However, you don't have to pass model_name to load_adapter() as long as you're loading the adapter from the local file system.

calpt avatar May 19 '22 10:05 calpt

Thanks a lot, it works ! However I am facing trouble when I have to re-load the model (like for testing or inference). I save the model using

# saving the model in hugginface format
        model.save_pretrained('./best_model_hugginface/model_hugginface')
       
        if adapter == 'True':
            #save only the adapter separately 
            model.bert.save_adapter('./best_model_hugginface/final_adapter','adapter_v1')

So when I load it using BertClassifier.from_pretrained(''./best_model_hugginface/model_hugginface') without loading the adapter (because it is suppose to be inside the all model weights) and I check the structure, i can see it BUT i have this warning There are adapters available but none are activated for the forward pass. This is an extract from the structure:

(adapters): ModuleDict(
                (adapter_v1): Adapter(
                  (non_linearity): Activation_Function_Class(
                    (f): ReLU()
                  )
                  (adapter_down): Sequential(
                    (0): Linear(in_features=1024, out_features=64, bias=True)
                    (1): Activation_Function_Class(
                      (f): ReLU()

This is my updated BertClassifier:

class BertClassifierConfig(PretrainedConfig):
......

class BertClassifier(PreTrainedModel):
    """Bert Model for Classification Tasks."""
    config_class = BertClassifierConfig
    def __init__(self, freeze_bert=True):
        """
         @param    bert: a BertModel object
         @param    classifier: a torch.nn.Module classifier
         @param    freeze_bert (bool): Set `False` to fine-tune the BERT model
        """
        super(BertClassifier, self).__init__()

        # Instantiate BERT model
        # Specify hidden size of BERT, hidden size of our classifier, and number of labels
        self.bert = BertAdapterModel.from_pretrained(PREETRAINED_MODEL')
        self.D_in = 1024 
        self.H = 512
        self.D_out = 2
        

        # Add a new adapter
        self.bert.add_adapter("thermo_cl",set_active=True)
        self.bert.train_adapter(["thermo_cl"])

 
        # Instantiate the classifier head with some one-layer feed-forward classifier
        self.classifier = nn.Sequential(
            nn.Linear(self.D_in, 512),
            nn.Tanh(),
            nn.Linear(512, self.D_out),
            nn.Tanh()
        )
 
         # Freeze the BERT model
        if freeze_bert:
            for param in self.bert.parameters():
                param.requires_grad = True


    def forward(self, input_ids, attention_mask):
        ''' Feed input to BERT and the classifier to compute logits.
         @param    input_ids (torch.Tensor): an input tensor with shape (batch_size,
                       max_length)
         @param    attention_mask (torch.Tensor): a tensor that hold attention mask
                       information with shape (batch_size, max_length)
         @return   logits (torch.Tensor): an output tensor with shape (batch_size,
                       num_labels) '''
         # Feed input to BERT
        outputs = self.bert(input_ids=input_ids,
                             attention_mask=attention_mask)
         
         # Extract the last hidden state of the token `[CLS]` for classification task
        last_hidden_state_cls = outputs[0][:, 0, :]
 
         # Feed input to classifier to compute logits
        logits = self.classifier(last_hidden_state_cls)
 
        return logits

I had used the PretrainedConfig and PreTrainedModel from hugginface.

Thanks a lot !

Ch-rode avatar Jun 16 '22 15:06 Ch-rode

also if i print out print I have: bert_classifier = BertClassifier.from_pretrained('bestmodel_hugginface/model_hugginface/') print(bert_classifier.config.adapters.adapters) I can see it: {'adapter_v1': 'pfeiffer'} #337 Here I can see your comment

Also note that when you call model.save_pretrained() on a model with adapters, it will save the full model along with the adapters (in the same file). Thus, you don't need to save adapters separately in this case.

So my question is how this warning There are adapters available but none are activated for the forward pass must be interpreted?

Thanks a lot !

Ch-rode avatar Jun 16 '22 15:06 Ch-rode

Hey, sorry for taking so long to answer. This warning: There are adapters available but none are activated for the forward pass usually means that haven't activated any adapters to be used in a forward pass. While the adapters are re-loaded together with the model automatically, they still have to be activated again to be used: model.set_active_adapters("adapter_v1"). The currently active adapters can be printed using model.active_adapters.

calpt avatar Aug 01 '22 20:08 calpt

Thanks for the reply. If I use model.set_active_adapters("adapter_v1") I receive Overwriting existing adapter 'adapter_v1'.

Ch-rode avatar Aug 08 '22 18:08 Ch-rode

That's interesting, this warning should not possibly occur when calling set_active_adapters() as this method is not loading any adapter weights. Usually, this happens when you load weights from an adapter that is already added to the model. E.g., when an adapter is loaded together with the model using from_pretrained() and, afterwards, is loaded once again using load_adapter() somwhere. However, as long as the last loaded checkpoint is the correct one, this warning will not result in any issues.

calpt avatar Aug 10 '22 12:08 calpt

This issue has been automatically marked as stale because it has been without activity for 90 days. This issue will be closed in 14 days unless you comment or remove the stale label.

adapter-hub-bert avatar Nov 09 '22 06:11 adapter-hub-bert

This issue was closed because it was stale for 14 days without any activity.

adapter-hub-bert avatar Nov 24 '22 06:11 adapter-hub-bert