mmengine icon indicating copy to clipboard operation
mmengine copied to clipboard

[Bug] 多卡情况下,训练后eval和离线test的精度不能保证一致

Open whlook opened this issue 1 year ago • 0 comments

Prerequisite

  • [X] I have searched Issues and Discussions but cannot get the expected help.
  • [X] The bug has not been fixed in the latest version(https://github.com/open-mmlab/mmengine).

Environment

image

Reproduces the problem - code sample

如果模型带有BN(不是syncbn)进行多卡训练(2卡)后,进行eval的测试,每个rank的bn是不一样的,导致最后测试的精度与test不一致;离线test是重新load同一个pth,所以每次test结果都一致

Reproduces the problem - command or script

必现,在DDP环境下,并且使用了BN会出现

Reproduces the problem - error message

None

Additional information

  1. eval after train应该保证与test一样的可靠性
  2. test中所有rank所使用的权重参数都是一样的
  3. train之后的eval每个rank所使用的bn参数是不一样的
  4. 在val之前应该做好model同步工作(TODO)

whlook avatar Apr 28 '24 11:04 whlook