DINO icon indicating copy to clipboard operation
DINO copied to clipboard

How to check the progress of distributed run "bash scripts/DINO_train_submitit.sh /path/to/my/COCODIR"

Open shenw000 opened this issue 1 year ago • 1 comments

I am using pytorch 1.11 on Ubuntu 20.04. The system configuration works fine with the command "bash scripts/DINO_train.sh /path/to/my/COCODIR". I have submitted a distributed run of "bash scripts/DINO_train_submitit.sh /path/to/my/COCODIR". The terminal (command line window) shows "Submitted job_id: 11007" and returns to system prompt. Nothing shows up in the terminal after that. Does that mean the distributed run is continous running or something went wrong? I checked the "experiments" folder and nothing is generated there either. As a result, I am asking for help to find a way to know if my training job is terminated or is its still progressing. If the training is progress, how much it has progressed, e.g. number of epochs completed, etc...

shenw000 avatar Apr 26 '23 19:04 shenw000