EncT5 Loss does not go down

🐛 Bug Report

The loss does not go down and get convergence to get a valid reproduction result.

🔬 How To Reproduce

Steps to reproduce the behavior:

create a new environment that matches the dependencies' versions
clone the repo and run bash scripts/run_glue_gpu.sh

Environment

OS: Linux
Python version: Python 3.7.16

transformers       4.15.0
torch              1.8.1+cu111

Sep 22 '23 04:09 Spico197

Having the same issue, across multiple tasks. My environment are as follows:

transformers        '4.20.0.dev0'
torch                    '1.11.0+cu113'

I am using the following code at each iteration:

# calculate logits and loss
outputs = enct5(input_ids.to("cuda"), attention_mask.to("cuda"))
m = nn.CrossEntropyLoss(reduction='none')
_loss = m(outputs.logits.to("cuda"), labels.to("cuda"))
loss = _loss.mean()

# backpropagation and optimization
enct5.optimizer.zero_grad()
loss.backward()
enct5.optimizer.step()

Oct 06 '23 05:10 Aatlantise

Here's my job status on RTE:

Oct 06 '23 15:10 Spico197

@Spico197 Since it looks like you're running the given code so it may not be as relevant, but I wanted to share that I was able to reduce loss by refactoring the code into a Pytorch Lightning implementation.

I'd also like to note that my code is an adaptation of @monologg 's--I noticed that this repository may be an imcomplete implementation of EncT5.

Oct 12 '23 08:10 Aatlantise

@Spico197 Since it looks like you're running the given code so it may not be as relevant, but I wanted to share that I was able to reduce loss by refactoring the code into a Pytorch Lightning implementation.

I'd also like to note that my code is an adaptation of @monologg 's--I noticed that this repository may be an imcomplete implementation of EncT5.

@Aatlantise Hi there, thank you very much for your information!

Except for migrating the code to pytorch lightning modules, are there any other changes that I should beware to reduce the loss?

Is it possible for you to reproduce the performance on the GLUE benchmark?

Oct 12 '23 08:10 Spico197

@Spico197 I was successful in reducing the loss by training with a lightning module, but I am still working on reproducing T5's performance with EncT5 (Liu et al.)'s architecture: 12-layer encoder + 1-layer decoder + classification head with their particular choice of hyperparameters.

On the other hand, I was able to reproduce or exceed T5 and BERT's performance with a simpler implementation:

model = T5ForConditionalGeneration.from_pretrained("t5-base")
encoder = model.encoder
classification_head = nn.Linear(*args)

hidden_states = encoder(input_ids, attention_mask)
logits = classification_head(encoder_output.last_hidden_state)

This model is highly unstable but is able to match or exceed T5 or BERT performance with certain seed (1 in 7-15 runs according to my experiments).

Hope this is useful, and please do share if you run into any other helpful info!

Oct 13 '23 06:10 Aatlantise

Hi @Spico197 , wanted to share a relevant discussion you might be interested in: https://github.com/huggingface/transformers/pull/26683

Oct 18 '23 06:10 Aatlantise

@Aatlantise Thank you very much for your sharing! I'm planning to rewrite the model from scratch and see if there's a performance difference. I'll update on this thread if there's any further information.

Oct 18 '23 10:10 Spico197