hpc icon indicating copy to clipboard operation
hpc copied to clipboard

deepcam dummy wireup error

Open sparticlesteve opened this issue 3 years ago • 0 comments

It's probably not a common use-case, but the "dummy" wireup method for deepcam doesn't seem to work.

Here's an example script at NERSC:

#!/bin/bash
#SBATCH -A nstaff_g
#SBATCH -q early_science
#SBATCH -C gpu
#SBATCH -J mlperf-deepcam
#SBATCH --nodes 1
#SBATCH --ntasks-per-node=1
#SBATCH --gpus-per-node 1
#SBATCH --cpus-per-task=32
#SBATCH --time 30
#SBATCH --image sfarrell/deepcam:ref-21.12

# Configuration
local_batch_size=2
batchnorm_group_size=1
data_dir="/global/cfs/cdirs/mpccc/gsharing/sfarrell/climate-data/All-Hist"
output_dir="$SCRATCH/deepcam/results"
run_tag="test_dummy_${SLURM_JOB_ID}"

srun --mpi=pmi2 shifter --module gpu \
       python ./train.py \
       --wireup_method "dummy" \
       --run_tag ${run_tag} \
       --data_dir_prefix ${data_dir} \
       --output_dir ${output_dir} \
       --model_prefix "segmentation" \
       --optimizer "LAMB" \
       --start_lr 0.0055 \
       --lr_schedule type="multistep",milestones="800",decay_rate="0.1" \
       --lr_warmup_steps 400 \
       --lr_warmup_factor 1. \
       --weight_decay 1e-2 \
       --logging_frequency 10 \
       --save_frequency 0 \
       --max_epochs 1 \
       --max_inter_threads 4 \
       --seed $(date +%s) \
       --batchnorm_group_size ${batchnorm_group_size} \
       --local_batch_size ${local_batch_size}

This gives a runtime error when constructing the DDP wrapper:

Traceback (most recent call last):
  File "./train.py", line 256, in <module>
    main(pargs)
  File "./train.py", line 167, in main
    ddp_net = DDP(net, device_ids=[device.index],
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 551, in __init__
    self.process_group = _get_default_group()
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 412, in _get_default_group
    raise RuntimeError(
RuntimeError: Default process group has not been initialized, please make sure to call init_process_group.

sparticlesteve avatar Oct 06 '22 21:10 sparticlesteve