sparseml
sparseml copied to clipboard
Deadlock when running imagenet training in distributed mode
When running the imagenet training script from torchvision (i.e src/sparseml/pytorch/image_classification/train.py) I face the deadlock issue at certain random moment of training (in between the training epoch, usually somewhere in between 20-30 epoch).
File "/nfs/scistore14/alistgrp/dkuznede/miniconda3/envs/pysparse/lib/python3.10/site-packages/sparseml_nightly-1.2.0.20221005-py3.10.egg/sparseml/pytorch/utils/module.py", line 810, in run_epoch
return self.run(
File "/nfs/scistore14/alistgrp/dkuznede/miniconda3/envs/pysparse/lib/python3.10/site-packages/sparseml_nightly-1.2.0.20221005-py3.10.egg/sparseml/pytorch/utils/module.py", line 712, in run
batch_results = self._runner_batch(
File "/nfs/scistore14/alistgrp/dkuznede/miniconda3/envs/pysparse/lib/python3.10/site-packages/sparseml_nightly-1.2.0.20221005-py3.10.egg/sparseml/pytorch/utils/module.py", line 989, in _runner_batch
self._run_funcs.model_backward(losses, self._module, scaler=self._scaler)
File "/nfs/scistore14/alistgrp/dkuznede/miniconda3/envs/pysparse/lib/python3.10/site-packages/sparseml_nightly-1.2.0.20221005-py3.10.egg/sparseml/pytorch/utils/module.py", line 82, in def_model_backward
loss.backward()
File "/nfs/scistore14/alistgrp/dkuznede/miniconda3/envs/pysparse/lib/python3.10/site-packages/torch/_tensor.py", line 396, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
File "/nfs/scistore14/alistgrp/dkuznede/miniconda3/envs/pysparse/lib/python3.10/site-packages/torch/autograd/__init__.py", line 173, in backward
Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
RuntimeError: NCCL communicator was aborted on rank 1. Original reason for failure was: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=735844, OpType=BROADCAST, Timeout(ms)=1800000) ran for 1806003 milliseconds before timing out.
The command to launch the script is the following (training on 2 GPUs):
python -m torch.distributed.launch --nproc_per_node 2 --master_port 29501 train.py \
--dataset-path ${DATA_DIR} \
--recipe-path recipes/gm_resnet.yaml \
--pretrained True \
--arch-key resnet50 \
--dataset imagenet \
--train-batch-size ${BATCH_SIZE_PER_GPU} \
--test-batch-size 256 \
--loader-num-workers 16 \
--save-dir runs/resnet50 \
--logs-dir runs/resnet50 \
--model-tag resnet50-imagenet-pruned \
The SparseML recipe defines gradual pruning of ResNet50 with custom learning rate schedule 100 epoch long:
training_modifiers:
- !EpochRangeModifier
start_epoch: 0
end_epoch: 100
- !LearningRateModifier
start_epoch: 40.0
end_epoch: 90.0
lr_class: StepLR
lr_kwargs:
step_size: 6
gamma: 0.6
init_lr: 0.005
pruning_modifiers:
- !GlobalMagnitudePruningModifier
params: __ALL_PRUNABLE__
init_sparsity: 0.05
final_sparsity: 0.95
start_epoch: 0
end_epoch: 40
update_frequency: 5
inter_func: linear
mask_type: unstructured
- !GlobalMagnitudePruningModifier
params: __ALL_PRUNABLE__
init_sparsity: 0.95
final_sparsity: 0.95
start_epoch: 40
end_epoch: 100
update_frequency: 60
inter_func: linear
mask_type: unstructured
System specifications
Operating system:
Debian GNU/Linux 11
Hardware:
2 Nvidia A100
SparseML version:
sparseml-nightly 1.2.0.20221005
CUDA toolkit version:
11.3
What could be the potential cause?