fastNLP 使用Trainer时遇到的一个错误

使用Trainer时遇到的一个错误

Open warrior-yyyan opened this issue 3 years ago • 1 comments

在py3.9, torch1.11下，使用Trainer报了一个错误： RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.FloatTensor [32, 50, 711]], which is output 0 of ReluBackward0, is at version 1; expected version 0 instead. Hint: enable anomaly detection to find the operation that failed to compute its gradient, with torch.autograd.set_detect_anomaly(True). 使用DataSetIter自定义训练时就不会报错，去网上查了查这个错误的解决方案，大概是inplace的改动导致的，是因为torch版本的问题导致的吗？在高本版torch下如果还想直接使用Trainer而不是自定义训练，该如何解决呢？

Apr 02 '22 09:04 warrior-yyyan

从报错来看是由于网络中存在ReLu，并且在设置了其inplace=True，你可以检查下网络中有这个问题么？另外，在device='cpu'的情况下可以运行嘛？或者报错是什么，有可能cuda场景下，真正出错的地方不是raise的地方。

Apr 02 '22 11:04 yhcc

fastNLP fastNLP copied to clipboard

使用Trainer时遇到的一个错误

fastNLP
fastNLP copied to clipboard