Loss does not go down
🐛 Bug Report
The loss does not go down and get convergence to get a valid reproduction result.
🔬 How To Reproduce
Steps to reproduce the behavior:
- create a new environment that matches the dependencies' versions
- clone the repo and run
bash scripts/run_glue_gpu.sh
Environment
- OS: Linux
- Python version: Python 3.7.16
transformers 4.15.0
torch 1.8.1+cu111
Having the same issue, across multiple tasks. My environment are as follows:
transformers '4.20.0.dev0'
torch '1.11.0+cu113'
I am using the following code at each iteration:
# calculate logits and loss
outputs = enct5(input_ids.to("cuda"), attention_mask.to("cuda"))
m = nn.CrossEntropyLoss(reduction='none')
_loss = m(outputs.logits.to("cuda"), labels.to("cuda"))
loss = _loss.mean()
# backpropagation and optimization
enct5.optimizer.zero_grad()
loss.backward()
enct5.optimizer.step()
Here's my job status on RTE:
@Spico197 Since it looks like you're running the given code so it may not be as relevant, but I wanted to share that I was able to reduce loss by refactoring the code into a Pytorch Lightning implementation.
I'd also like to note that my code is an adaptation of @monologg 's--I noticed that this repository may be an imcomplete implementation of EncT5.
@Spico197 Since it looks like you're running the given code so it may not be as relevant, but I wanted to share that I was able to reduce loss by refactoring the code into a Pytorch Lightning implementation.
I'd also like to note that my code is an adaptation of @monologg 's--I noticed that this repository may be an imcomplete implementation of EncT5.
@Aatlantise Hi there, thank you very much for your information!
Except for migrating the code to pytorch lightning modules, are there any other changes that I should beware to reduce the loss?
Is it possible for you to reproduce the performance on the GLUE benchmark?
@Spico197 I was successful in reducing the loss by training with a lightning module, but I am still working on reproducing T5's performance with EncT5 (Liu et al.)'s architecture: 12-layer encoder + 1-layer decoder + classification head with their particular choice of hyperparameters.
On the other hand, I was able to reproduce or exceed T5 and BERT's performance with a simpler implementation:
model = T5ForConditionalGeneration.from_pretrained("t5-base")
encoder = model.encoder
classification_head = nn.Linear(*args)
hidden_states = encoder(input_ids, attention_mask)
logits = classification_head(encoder_output.last_hidden_state)
This model is highly unstable but is able to match or exceed T5 or BERT performance with certain seed (1 in 7-15 runs according to my experiments).
Hope this is useful, and please do share if you run into any other helpful info!
Hi @Spico197 , wanted to share a relevant discussion you might be interested in: https://github.com/huggingface/transformers/pull/26683
@Aatlantise Thank you very much for your sharing! I'm planning to rewrite the model from scratch and see if there's a performance difference. I'll update on this thread if there's any further information.