model-zoo
model-zoo copied to clipboard
multi-gpu tensorboard handlers initialization
https://github.com/Project-MONAI/model-zoo/blob/cf5e0322ee25b178b6cf841f3bd81e0a8adf2b16/models/spleen_ct_segmentation/configs/multi_gpu_train.json#L18
the multi-gpu override essentially set the trainer handlers to $@train#handlers[:-2]
for the worker nodes. but because of the @train#handlers
reference, the config parser will still trigger handler constructor calls on all nodes.
for tensorboard handlers this will be an issue, as each constructor call will create a new event log file. as a result the multinode log will have unnecessary event logging files. https://github.com/Project-MONAI/MONAI/blob/e36982b87bf87fb9559fc4d124e132b67f177d23/monai/handlers/tensorboard_handlers.py#L52-L55
a possible fix is to introduce a flag:
diff --git a/configs/multi_gpu_train.json b/configs/multi_gpu_train.json
index ea41b9f..f323b02 100644
--- a/configs/multi_gpu_train.json
+++ b/configs/multi_gpu_train.json
@@ -1,5 +1,6 @@
{
"device": "$torch.device('cuda:' + os.environ['LOCAL_RANK'])",
+ "use_tensorboard": "$dist.get_rank() == 0",
"network": {
"_target_": "torch.nn.parallel.DistributedDataParallel",
"module": "$@network_def.to(@device)",
diff --git a/configs/train.json b/configs/train.json
index 7c866fe..80f15d3 100644
--- a/configs/train.json
+++ b/configs/train.json
@@ -10,6 +10,7 @@
"output_dir": "$@bundle_root + '/eval'",
"data_list_file_path": "$@bundle_root + '/msd_task09_spleen_folds.json'",
"dataset_dir": "/data/Task09_Spleen",
+ "use_tensorboard": true,
"finetune": false,
"finetune_model_path": "$@bundle_root + '/models/model.pt'",
"early_stop": false,
@@ -191,6 +192,7 @@
},
{
"_target_": "TensorBoardStatsHandler",
+ "_disabled_": "$not @use_tensorboard",
"log_dir": "@output_dir",
"tag_name": "train_loss",
"output_transform": "$monai.handlers.from_engine(['loss'], first=True)"
@@ -279,6 +281,7 @@
},
{
"_target_": "TensorBoardStatsHandler",
+ "_disabled_": "$not @use_tensorboard",
"log_dir": "@output_dir",
"iteration_log": false
},
Thanks @wyli . I will take a look at this issue and your suggestion. Or @KumoLiu , if you have time could you please help to address it? Can check with the deepedit bundle first.
cc @Nic-Ma