pytorch-template
pytorch-template copied to clipboard
TODO: also configure logging for sub-processes(not master)
Hi victoresque,
Thanks for your hero repo! I used hydra_DDP
branch to build my application, but got some problems in get_logger. Specifically, the program util.py loads the '.hydra/hydra.yaml' file from the directory, but hydra.yaml
only exists in the 'output directory' such as 'outputs/2022-09-25/15-16-17' so python can't find it. I'm a little puzzled about the path of hydra.yaml
. Maybe get_logger should load the /hydra.yaml
from output directory
? Could anyone help me! Thanks in advance!
(base) python train.py
Traceback (most recent call last):
File "/mnt/petrelfs/qudelin/PJLAB/RS/VRS-Transformer/train.py", line 19, in <module>
logger = get_logger("train")
File "/mnt/petrelfs/qudelin/PJLAB/RS/VRS-Transformer/src/utils/util.py", line 19, in get_logger
hydra_conf = OmegaConf.load('.hydra/hydra.yaml')
File "/mnt/petrelfs/qudelin/miniconda3/lib/python3.9/site-packages/omegaconf/omegaconf.py", line 187, in load
with io.open(os.path.abspath(file_), "r", encoding="utf-8") as f:
FileNotFoundError: [Errno 2] No such file or directory: '/mnt/petrelfs/qudelin/PJLAB/RS/VRS-Transformer/.hydra/hydra.yaml'
Hi @DelinQu, thank you for raising this issue.
It seems that get_logger
is problematic as you pointed out and I'm currently working on this. I'll let you know if there is any progress.
Also, if you are interested in using hydra_DDP
branch of this repo, I would recommend using my clone version of that. I made a few commits which solves serious bug(It was not using DDP at all) a while ago on that branch, but forgot to make a PR to apply that fix to this repo. I will make PR soon, but it could take some time.
Hi @DelinQu, thank you for raising this issue.
It seems that
get_logger
is problematic as you pointed out and I'm currently working on this. I'll let you know if there is any progress.Also, if you are interested in using
hydra_DDP
branch of this repo, I would recommend using my clone version of that. I made a few commits which solves serious bug(It was not using DDP at all) a while ago on that branch, but forgot to make a PR to apply that fix to this repo. I will make PR soon, but it could take some time.
Thanks for your replying SunQpark, I will follow your DDP! 😃
Oh thanks, but I'll let this issue open yet!
Oh thanks, but I'll let this issue open yet!
Hi SunQpark,
your repo has really helped me tremendously, but I got another problem when the training process is early stopped:
Although it doesn't affect my models much, the error persists. My configuration file is as follows:
n_cpu: 8
n_gpu: 8
batch_size: 4
learning_rate: 0.0001
weight_decay: 0
scheduler_step_size: 50
scheduler_gamma: 0.1
status: train
trainer:
epochs: 500
logging_step: 100
monitor: min loss/valid
save_topk: 5
early_stop: 10
tensorboard: true