pytorch-template icon indicating copy to clipboard operation
pytorch-template copied to clipboard

TODO: also configure logging for sub-processes(not master)

Open DelinQu opened this issue 2 years ago • 4 comments

Hi victoresque, Thanks for your hero repo! I used hydra_DDP branch to build my application, but got some problems in get_logger. Specifically, the program util.py loads the '.hydra/hydra.yaml' file from the directory, but hydra.yaml only exists in the 'output directory' such as 'outputs/2022-09-25/15-16-17' so python can't find it. I'm a little puzzled about the path of hydra.yaml. Maybe get_logger should load the /hydra.yaml from output directory? Could anyone help me! Thanks in advance!

image

(base) python train.py                     
Traceback (most recent call last):
  File "/mnt/petrelfs/qudelin/PJLAB/RS/VRS-Transformer/train.py", line 19, in <module>
    logger = get_logger("train")
  File "/mnt/petrelfs/qudelin/PJLAB/RS/VRS-Transformer/src/utils/util.py", line 19, in get_logger
    hydra_conf = OmegaConf.load('.hydra/hydra.yaml')
  File "/mnt/petrelfs/qudelin/miniconda3/lib/python3.9/site-packages/omegaconf/omegaconf.py", line 187, in load
    with io.open(os.path.abspath(file_), "r", encoding="utf-8") as f:
FileNotFoundError: [Errno 2] No such file or directory: '/mnt/petrelfs/qudelin/PJLAB/RS/VRS-Transformer/.hydra/hydra.yaml'

DelinQu avatar Oct 24 '22 11:10 DelinQu

Hi @DelinQu, thank you for raising this issue.

It seems that get_logger is problematic as you pointed out and I'm currently working on this. I'll let you know if there is any progress.

Also, if you are interested in using hydra_DDP branch of this repo, I would recommend using my clone version of that. I made a few commits which solves serious bug(It was not using DDP at all) a while ago on that branch, but forgot to make a PR to apply that fix to this repo. I will make PR soon, but it could take some time.

SunQpark avatar Oct 25 '22 03:10 SunQpark

Hi @DelinQu, thank you for raising this issue.

It seems that get_logger is problematic as you pointed out and I'm currently working on this. I'll let you know if there is any progress.

Also, if you are interested in using hydra_DDP branch of this repo, I would recommend using my clone version of that. I made a few commits which solves serious bug(It was not using DDP at all) a while ago on that branch, but forgot to make a PR to apply that fix to this repo. I will make PR soon, but it could take some time.

Thanks for your replying SunQpark, I will follow your DDP! 😃

DelinQu avatar Oct 25 '22 03:10 DelinQu

Oh thanks, but I'll let this issue open yet!

SunQpark avatar Oct 25 '22 04:10 SunQpark

Oh thanks, but I'll let this issue open yet!

Hi SunQpark, your repo has really helped me tremendously, but I got another problem when the training process is early stopped: image

Although it doesn't affect my models much, the error persists. My configuration file is as follows:

n_cpu: 8
n_gpu: 8
batch_size: 4
learning_rate: 0.0001
weight_decay: 0
scheduler_step_size: 50
scheduler_gamma: 0.1
status: train
trainer:
  epochs: 500
  logging_step: 100
  monitor: min loss/valid
  save_topk: 5
  early_stop: 10
  tensorboard: true

DelinQu avatar Oct 27 '22 02:10 DelinQu