exploring-T5 icon indicating copy to clipboard operation
exploring-T5 copied to clipboard

T5FineTuner issue "in training_epoch_end avg_train_loss = torch.stack([x["loss"] for x in outputs]).mean() "

Open GeYue opened this issue 4 years ago • 4 comments

Hi, Suraj, I am trying to use your T5FineTune class to study the fine tune skill. But, unfortunately, when I tried to run the program on my env, I got this error:

in training_epoch_end avg_train_loss = torch.stack([x["loss"] for x in outputs]).mean() RuntimeError: stack expects a non-empty TensorList

I tried to track the cause and found that the "training_step" never be called. I think it may relate with the "ImdbDataSet" for the train_dataloadder, but I debuged it and it seems all right. I just begin to contact the DeepLearning, so maybe there is something is obvious but I really don't know.

Do you have any idea about what may cause it? Thank you and looking forward your any feedback.

Best Regards

GeYue avatar Dec 28 '20 15:12 GeYue

Hi! I had the same problem and I figured out that it was a package version problem. In order to make this notebook work properly, you need to use this versions:

!pip install transformers==2.9.0 
!pip install pytorch_lightning==0.7.5

MarcosFP97 avatar Jan 27 '21 10:01 MarcosFP97

I have created a PR, but meanwhile you can download the fixed notebook from my fork: here

Best, Marcos

MarcosFP97 avatar Jan 27 '21 10:01 MarcosFP97

Thank @MarcosFP97 for the answer, I got the same issue and the loss is 'nan' during training. It can be solved by changing the package into the right version.

And also, perhaps the problem may be caused by the self-defined optimizer_step function. Another solution can be adding closure=optimizer_closure in optimizer.step() in the function optimizer_step(). This may work because in the self-defined optimizer_step() function, we need a closure function to return the last-training-backward-result to the ProgressBar in tqdm_dict.

In this way, my problem got solved without changing the package version. For example, add closure=optimizer_closure in the function:

def optimizer_step(self, epoch, batch_idx, optimizer, optimizer_idx, optimizer_closure, on_tpu=False, using_native_amp=False, using_lbfgs=False):
    if self.trainer.use_tpu:
        xm.optimizer_step(optimizer)
    else:             
        optimizer.step(closure=optimizer_closure)
    optimizer.zero_grad()
    self.lr_scheduler.step()

Jackthebighead avatar Nov 09 '21 07:11 Jackthebighead

Thanks for your comments @Jackthebighead!

MarcosFP97 avatar Nov 09 '21 12:11 MarcosFP97