hpc
hpc copied to clipboard
deepcam dummy wireup error
It's probably not a common use-case, but the "dummy" wireup method for deepcam doesn't seem to work.
Here's an example script at NERSC:
#!/bin/bash
#SBATCH -A nstaff_g
#SBATCH -q early_science
#SBATCH -C gpu
#SBATCH -J mlperf-deepcam
#SBATCH --nodes 1
#SBATCH --ntasks-per-node=1
#SBATCH --gpus-per-node 1
#SBATCH --cpus-per-task=32
#SBATCH --time 30
#SBATCH --image sfarrell/deepcam:ref-21.12
# Configuration
local_batch_size=2
batchnorm_group_size=1
data_dir="/global/cfs/cdirs/mpccc/gsharing/sfarrell/climate-data/All-Hist"
output_dir="$SCRATCH/deepcam/results"
run_tag="test_dummy_${SLURM_JOB_ID}"
srun --mpi=pmi2 shifter --module gpu \
python ./train.py \
--wireup_method "dummy" \
--run_tag ${run_tag} \
--data_dir_prefix ${data_dir} \
--output_dir ${output_dir} \
--model_prefix "segmentation" \
--optimizer "LAMB" \
--start_lr 0.0055 \
--lr_schedule type="multistep",milestones="800",decay_rate="0.1" \
--lr_warmup_steps 400 \
--lr_warmup_factor 1. \
--weight_decay 1e-2 \
--logging_frequency 10 \
--save_frequency 0 \
--max_epochs 1 \
--max_inter_threads 4 \
--seed $(date +%s) \
--batchnorm_group_size ${batchnorm_group_size} \
--local_batch_size ${local_batch_size}
This gives a runtime error when constructing the DDP wrapper:
Traceback (most recent call last):
File "./train.py", line 256, in <module>
main(pargs)
File "./train.py", line 167, in main
ddp_net = DDP(net, device_ids=[device.index],
File "/opt/conda/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 551, in __init__
self.process_group = _get_default_group()
File "/opt/conda/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 412, in _get_default_group
raise RuntimeError(
RuntimeError: Default process group has not been initialized, please make sure to call init_process_group.