mmcv icon indicating copy to clipboard operation
mmcv copied to clipboard

[Fix] DDP duplicate logs handling

Open sleep3r opened this issue 3 years ago • 23 comments

Motivation

Essentially, PyTorch DDP attaches StreamHandler directly after first forward() and by default logger.root.handlers may be empty list

Modification

Added manual append to root.handlers list of StreamHandler with level ERROR in case of empty list

Checklist

Before PR:

  • [x] I have read and followed the workflow indicated in the CONTRIBUTING.md to create this PR.
  • [x] Pre-commit or linting tools indicated in CONTRIBUTING.md are used to fix the potential lint issues.
  • [x] Bug fixes are covered by unit tests, the case that causes the bug should be added in the unit tests.
  • [x] New functionalities are covered by complete unit tests. If not, please add more unit test to ensure the correctness.
  • [x] The documentation has been modified accordingly, including docstring or example tutorials.

After PR:

  • [ ] If the modification has potential influence on downstream or other related projects, this PR should be tested with some of those projects, like MMDet or MMCls.
  • [x] CLA has been signed and all committers have signed the CLA in this PR.

sleep3r avatar Jan 22 '22 20:01 sleep3r

CLA assistant check
All committers have signed the CLA.

CLAassistant avatar Jan 22 '22 20:01 CLAassistant

Hi @sleep3r ,thanks for your contribution. What is your PyTorch version?

zhouzaida avatar Jan 23 '22 01:01 zhouzaida

@zhouzaida Tested with PyTorch 1.10.1

I tried your logging utils with my trainloop and figured out some strange logging behaviour, let me show you some example.

sleep3r avatar Jan 23 '22 11:01 sleep3r

Снимок экрана 2022-01-23 в 14 59 03

sleep3r avatar Jan 23 '22 11:01 sleep3r

Here you could see, that by default the logger has not any handlers and DDP attaches one after forward of batch with idx 1. Therefore, the following logs are duplicated

sleep3r avatar Jan 23 '22 12:01 sleep3r

I may be missing something in case of influence on downstream or other related projects, but it seems for me to be a good notice for someone with a possible similar problem

sleep3r avatar Jan 23 '22 12:01 sleep3r

Thank you very much for the description. BTW, is it convenient for you to test that torch<1.10.1 has similar behavior

zhouzaida avatar Jan 23 '22 12:01 zhouzaida

@zhouzaida I've just tested. So, seems like this behaviour starts from torch==1.9.0.

1.8.0 and 1.8.1 are fine

sleep3r avatar Jan 23 '22 14:01 sleep3r

@zhouzaida I've just tested. So, seems like this behaviour starts from torch==1.9.0.

1.8.0 and 1.8.1 are fine

Got it.

zhouzaida avatar Jan 23 '22 15:01 zhouzaida

LGTM. But it seems that the propagate attribute of Logger causes the duplicate log. Can set logger.propagate to False solve the problem of duplicate logs? This way we don't need the change attribute of the root logger any more。

HAOCHENYE avatar Jan 26 '22 12:01 HAOCHENYE

@HAOCHENYE Well, yes logger.propagate=False solves this issue

sleep3r avatar Jan 26 '22 14:01 sleep3r

But this fix doesn't inherit the stated one in utils/logging.py ¯_(ツ)_/¯

sleep3r avatar Jan 26 '22 14:01 sleep3r

I mean, is it legit to replace all that StreamHandler setting stuff just by logger.propagate = False? Don't think so

sleep3r avatar Jan 26 '22 14:01 sleep3r

The init_process_group function in Pytorch 1.8 adds a StreamHandler to the root logger, so if get_logger is called after init_process_group, the log level of the root logger will be set to ERROR and there will be no duplicate log. Pytorch 1.10 and Pytorch1.7 do not attach a StreamHandler to the root logger during the init_process_group phase but during DDP.forward (Logging.info is called, which adds the default StreamHandler(INFO) to the root logger. Logging.info is also called in Pytorch1.8, while the log level of StreamHandler was set to ERROR by get_logger). The logging level of the StreamHandler at this point is INFO, so there will be a duplicate log caused by the propagate mechanism. Since the duplicate log is caused by the propagate attribute of the child logger, should we set the propagate to False? Here is the logger setting for Detectron2 https://github.com/facebookresearch/detectron2/blob/cbbc1ce26473cb2a5cc8f58e8ada9ae14cb41052/detectron2/utils/logger.py

HAOCHENYE avatar Jan 26 '22 15:01 HAOCHENYE

LGTM. I‘ll do further testing in the downstream codebase.

HAOCHENYE avatar Jan 27 '22 08:01 HAOCHENYE

Hi @sleep3r , is there any progress?

teamwong111 avatar Feb 21 '22 03:02 teamwong111

Hi @sleep3r , unit tests failed. https://app.circleci.com/pipelines/github/open-mmlab/mmcv/582/workflows/fe9adf11-579c-45c9-8802-4a15f74bb9d9/jobs/1207?invite=true#step-110-124

zhouzaida avatar May 12 '22 05:05 zhouzaida

@zhouzaida Yep, I mentioned this failed test above. But I haven't any thoughts how to fix it, unfortunately. Essentially, my new commit contains only some comment changes

sleep3r avatar May 12 '22 05:05 sleep3r

I could try the former fix without .propagate

sleep3r avatar May 12 '22 05:05 sleep3r

Lucky, it works

sleep3r avatar May 12 '22 06:05 sleep3r

@sleep3r thanks for your reply~. If we set default StreamHandler for logging.root in get_logger, It seems there will be no message shown in the terminal by calling logging.warning('xxx'):

import logging
sh = logging.StreamHandler()
sh.setLevel(logging.ERROR)
logging.root.addHandler(sh)

Is this behavior reasonable? Besides, what will happen if we set propagate=False, maybe there is some corner case I've not concerned.

HAOCHENYE avatar May 17 '22 06:05 HAOCHENYE

@HAOCHENYE I can't reproduce the behaviour with warning logging. Setting propagate=False leads to fallen tests, and I don't have much about this question too.

sleep3r avatar Jul 02 '22 20:07 sleep3r

@HAOCHENYE I can't reproduce the behaviour with warning logging. Setting propagate=False leads to fallen tests, and I don't have much about this question too.

@HAOCHENYE I can't reproduce the behaviour with warning logging. Setting propagate=False leads to fallen tests, and I don't have much about this question too.

Hi~ sorry for my the reply. This is the original issue #1691 I think it could be better if we do not need to change the root logger

HAOCHENYE avatar Jul 07 '22 02:07 HAOCHENYE

Hi @sleep3r!We are grateful for your efforts in helping improve this open-source project during your personal time. Welcome to join OpenMMLab Special Interest Group (SIG) private channel on Discord, where you can share your experiences, ideas, and build connections with like-minded peers. To join the SIG channel, simply message moderator— OpenMMLab on Discord or briefly share your open-source contributions in the #introductions channel and we will assist you. Look forward to seeing you there! Join us :https://discord.gg/UjgXkPWNqA If you have a WeChat account,welcome to join our community on WeChat. You can add our assistant :openmmlabwx. Please add "mmsig + Github ID" as a remark when adding friends:) Thank you again for your contribution❤ @sleep3r

OpenMMLab-Assistant003 avatar Apr 12 '23 08:04 OpenMMLab-Assistant003