sparseml Deadlock when running imagenet training in distributed mode

Deadlock when running imagenet training in distributed mode

Open Godofnothing opened this issue 3 years ago • 0 comments

When running the imagenet training script from torchvision (i.e src/sparseml/pytorch/image_classification/train.py) I face the deadlock issue at certain random moment of training (in between the training epoch, usually somewhere in between 20-30 epoch).

    File "/nfs/scistore14/alistgrp/dkuznede/miniconda3/envs/pysparse/lib/python3.10/site-packages/sparseml_nightly-1.2.0.20221005-py3.10.egg/sparseml/pytorch/utils/module.py", line 810, in run_epoch
      return self.run(
    File "/nfs/scistore14/alistgrp/dkuznede/miniconda3/envs/pysparse/lib/python3.10/site-packages/sparseml_nightly-1.2.0.20221005-py3.10.egg/sparseml/pytorch/utils/module.py", line 712, in run
      batch_results = self._runner_batch(
    File "/nfs/scistore14/alistgrp/dkuznede/miniconda3/envs/pysparse/lib/python3.10/site-packages/sparseml_nightly-1.2.0.20221005-py3.10.egg/sparseml/pytorch/utils/module.py", line 989, in _runner_batch
      self._run_funcs.model_backward(losses, self._module, scaler=self._scaler)
    File "/nfs/scistore14/alistgrp/dkuznede/miniconda3/envs/pysparse/lib/python3.10/site-packages/sparseml_nightly-1.2.0.20221005-py3.10.egg/sparseml/pytorch/utils/module.py", line 82, in def_model_backward
      loss.backward()
    File "/nfs/scistore14/alistgrp/dkuznede/miniconda3/envs/pysparse/lib/python3.10/site-packages/torch/_tensor.py", line 396, in backward
      torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
    File "/nfs/scistore14/alistgrp/dkuznede/miniconda3/envs/pysparse/lib/python3.10/site-packages/torch/autograd/__init__.py", line 173, in backward
      Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
  RuntimeError: NCCL communicator was aborted on rank 1.  Original reason for failure was: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=735844, OpType=BROADCAST, Timeout(ms)=1800000) ran for 1806003 milliseconds before timing out.

The command to launch the script is the following (training on 2 GPUs):

python -m torch.distributed.launch --nproc_per_node 2 --master_port 29501 train.py \
    --dataset-path ${DATA_DIR} \
    --recipe-path recipes/gm_resnet.yaml \
    --pretrained True \
    --arch-key resnet50 \
    --dataset imagenet \
    --train-batch-size ${BATCH_SIZE_PER_GPU} \
    --test-batch-size 256 \
    --loader-num-workers 16 \
    --save-dir runs/resnet50 \
    --logs-dir runs/resnet50 \
    --model-tag resnet50-imagenet-pruned \

The SparseML recipe defines gradual pruning of ResNet50 with custom learning rate schedule 100 epoch long:

training_modifiers:

  - !EpochRangeModifier
    start_epoch: 0
    end_epoch: 100

  - !LearningRateModifier
    start_epoch: 40.0
    end_epoch: 90.0
    lr_class: StepLR
    lr_kwargs:
      step_size: 6
      gamma: 0.6
    init_lr: 0.005

pruning_modifiers:

  - !GlobalMagnitudePruningModifier
    params: __ALL_PRUNABLE__
    init_sparsity: 0.05
    final_sparsity: 0.95
    start_epoch: 0
    end_epoch: 40
    update_frequency: 5
    inter_func: linear
    mask_type: unstructured

  - !GlobalMagnitudePruningModifier
    params: __ALL_PRUNABLE__
    init_sparsity: 0.95
    final_sparsity: 0.95
    start_epoch: 40
    end_epoch: 100
    update_frequency: 60
    inter_func: linear
    mask_type: unstructured

System specifications

Operating system: Debian GNU/Linux 11 Hardware: 2 Nvidia A100 SparseML version: sparseml-nightly 1.2.0.20221005 CUDA toolkit version: 11.3

What could be the potential cause?

Oct 06 '22 14:10 Godofnothing

sparseml sparseml copied to clipboard

Deadlock when running imagenet training in distributed mode

sparseml
sparseml copied to clipboard