PaddleDetection icon indicating copy to clipboard operation
PaddleDetection copied to clipboard

单机多卡训练rt-detrv2-r101,loss反向传播报错ValueError: (InvalidArgument) Required tensor shall not be nullptr, but received nullptr.

Open DoctorDream opened this issue 6 months ago • 10 comments

问题确认 Search before asking

  • [X] 我已经查询历史issue,没有发现相似的bug。I have searched the issues and found no similar bug report.

Bug组件 Bug Component

Training

Bug描述 Describe the Bug

当我使用下述指令训练rt-detr的时候:

python -m paddle.distributed.launch --gpus 0,1,2 tools/train.py -c configs/rtdetrv2/rtdetrv2_r101vd_6x_coco.yml --fleet --eval

会出现报错:

Traceback (most recent call last):
  File "/home/zqy/zqy/Codes/PaddleDetection/tools/train.py", line 209, in <module>
    main()
  File "/home/zqy/zqy/Codes/PaddleDetection/tools/train.py", line 205, in main
    run(FLAGS, cfg)
  File "/home/zqy/zqy/Codes/PaddleDetection/tools/train.py", line 158, in run
    trainer.train(FLAGS.eval)
  File "/home/zqy/zqy/Codes/PaddleDetection/ppdet/engine/trainer.py", line 614, in train
    loss.backward()
  File "/usr/local/lib/python3.10/dist-packages/decorator.py", line 232, in fun
    return caller(func, *(extras + args), **kw)
  File "/usr/local/lib/python3.10/dist-packages/paddle/base/wrapped_decorator.py", line 26, in __impl__
    return wrapped_func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/paddle/base/framework.py", line 593, in __impl__
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/paddle/base/dygraph/tensor_patch_methods.py", line 342, in backward
    core.eager.run_backward([self], grad_tensor, retain_graph)
ValueError: (InvalidArgument) Required tensor shall not be nullptr, but received nullptr.
  [Hint: tensor should not be null.] (at ../paddle/phi/core/device_context.cc:142)

当我使用单卡训练的时候就不会报错了

复现环境 Environment

  • OS:Linux
  • PaddlePaddle: paddlepaddle-gpu 2.6.1.post117和 paddlepaddle-gpu 2.6.0.post117
  • PaddleDetection: develop/2.7
  • python: 3.10
  • CUDA: 11.7

Bug描述确认 Bug description confirmation

  • [X] 我确认已经提供了Bug复现步骤、代码改动说明、以及环境信息,确认问题是可以复现的。I confirm that the bug replication steps, code change instructions, and environment information have been provided, and the problem can be reproduced.

是否愿意提交PR? Are you willing to submit a PR?

  • [ ] 我愿意提交PR!I'd like to help by submitting a PR!

DoctorDream avatar Aug 16 '24 03:08 DoctorDream