super-gradients
super-gradients copied to clipboard
Multiple GPUs on training
💡 Your Question
I have two available GPUs to try and train my model, but I have no luck finding the right training parameter to enable it.
train_params = {
# ENABLING SILENT MODE
"MultiGPUMode": "DP",
# "resume":True,
"average_best_models":True,
"warmup_mode": "linear_epoch_step",
"warmup_initial_lr": 1e-6,
"lr_warmup_epochs": 3,
"initial_lr": 5e-4,
"lr_mode": "cosine",
"cosine_final_lr_ratio": 0.1,
"optimizer": "AdamW",
"optimizer_params": {"weight_decay": 0.0001},
"zero_weight_decay_on_bias_and_bn": True,
"ema": True,
"ema_params": {"decay": 0.9, "decay_type": "threshold"},
"max_epochs": 20,
"mixed_precision": True,
"loss": PPYoloELoss(
use_static_assigner=False,
num_classes=config.NUM_CLASSES,
reg_max=16
),
"valid_metrics_list": [
DetectionMetrics_050(
score_thres=0.1,
top_k_predictions=300,
num_cls=config.NUM_CLASSES,
normalize_targets=True,
post_prediction_callback=PPYoloEPostPredictionCallback(
score_threshold=0.01,
nms_top_k=1000,
max_predictions=300,
nms_threshold=0.7
)
)
],
"metric_to_watch": '[email protected]'
}
These are the training parameters that I use, I've tried "MultiGPUMode": "DP", "MultiGPUMode": "AUTO", also "multi_gpu": "AUTO", but after I start training I get the same message:
[2023-06-16] INFO - sg_trainer_utils.py - TRAINING PARAMETERS:
- Mode: Single GPU
- Number of GPUs: 1 (2 available on the machine)
- Dataset size: 9245 (len(train_set))
- Batch size per GPU: 8 (batch_size)
- Batch Accumulate: 1 (batch_accumulate)
- Total batch size: 8 (num_gpus * batch_size)
- Effective Batch size: 8 (num_gpus * batch_size * batch_accumulate)
- Iterations per epoch: 1155 (len(train_loader))
- Gradient updates per epoch: 1155 (len(train_loader) / batch_accumulate)
[2023-06-16 INFO - sg_trainer.py - Started training for 20 epochs (0/19)
There is a link in the documentation on how to use multiple GPUs, but it is empty.
What parameter should I enter for multiple GPU support?
Versions
No response
You can refer to documentation to set up your config accordingly: https://docs.deci.ai/super-gradients/documentation/source/device.html#4-ddp-distributed-data-parallel
Please note, that running DDP from the Colab/Jupyter is theoretically possible, but tricky and is not covered in documentation. For DDP we suggest to use regular python scripts and recipes (you for sure can have your own) and run them in one of ways:
- https://github.com/Deci-AI/super-gradients/blob/master/src/super_gradients/recipes/roboflow_yolo_nas_s.yaml#L7
- Have your own train.py file similar to this: https://github.com/Deci-AI/super-gradients/blob/master/src/super_gradients/train_from_recipe.py
Hope this answers your question. Let us know whether this resolves your issue.
Thank you for the fast reply,
I will surely look into it.
@ferdzo I too have the same problem. If you get the solution please let me know. I will also update here if I get.
Waiting for a simple answer. Please paste some code that works.
Thank you
I have the same issue...working with super-gradients
version 3.6.0
set up the trainer with the magic function setup_device(multi_gpu='DP', num_gpus=2)
and the following train params:
train_params = {
# ENABLING SILENT MODE
"average_best_models": True,
"warmup_mode": "LinearBatchLRWarmup",
"warmup_initial_lr": 1e-4,
"lr_warmup_steps": 100,
"lr_warmup_epochs": 3,
"initial_lr": 33e-3,
"lr_mode": "cosine",
"cosine_final_lr_ratio": 0.1,
"optimizer": "SGD",
"optimizer_params": {"weight_decay": 0.0005},
"zero_weight_decay_on_bias_and_bn": True,
"ema": True,
"ema_params": {"decay": 0.937, "decay_type": "threshold"},
"enable_qat": True,
"cache_annotations": True,
# "dataset_statistics": True,
"max_epochs": 30,
"mixed_precision": True,
"loss": PPYoloELoss(
use_static_assigner=False,
num_classes=config.NUM_CLASSES,
# reg_max=config.BATCH_SIZE
),
"valid_metrics_list": [
DetectionMetrics_050(
score_thres=0.1,
top_k_predictions=300,
num_cls=config.NUM_CLASSES,
normalize_targets=True,
post_prediction_callback=PPYoloEPostPredictionCallback(
score_threshold=0.01,
nms_top_k=1000,
max_predictions=300,
nms_threshold=0.7
)
)
],
"metric_to_watch": '[email protected]',
"greater_metric_to_watch_is_better": True,
}
but it seems like it is still not recognizing all the GPUs...
- Mode: DATA_PARALLEL
- Number of GPUs: 1 (2 available on the machine)
- Full dataset size: 19970 (len(train_set))
- Batch size per GPU: 48 (batch_size)
- Batch Accumulate: 1 (batch_accumulate)
- Total batch size: 48 (num_gpus * batch_size)
- Effective Batch size: 48 (num_gpus * batch_size * batch_accumulate)
- Iterations per epoch: 416 (len(train_loader))
- Gradient updates per epoch: 416 (len(train_loader) / batch_accumulate)
- Model: YoloNAS_L (66.91M parameters, 66.91M optimized)
- Learning Rates and Weight Decays:
- default: (66.91M parameters). LR: 0.033 (66.91M parameters) WD: 0.0, (84.69K parameters), WD: 0.0005, (66.82M parameters)
and getting this error:
Traceback (most recent call last):
File "/home/yolonas/yolonas_project_template.py", line 153, in <module>
trainer.train(model=model,
File "/usr/local/lib/python3.10/dist-packages/super_gradients/training/sg_trainer/sg_trainer.py", line 1521, in train
train_metrics_tuple = self._train_epoch(context=context, silent_mode=silent_mode)
File "/usr/local/lib/python3.10/dist-packages/super_gradients/training/sg_trainer/sg_trainer.py", line 500, in _train_epoch
outputs = self.net(inputs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1520, in _call_impl
return forward_call(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/parallel/data_parallel.py", line 186, in forward
return self.gather(outputs, self.output_device)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/parallel/data_parallel.py", line 203, in gather
return gather(outputs, output_device, dim=self.dim)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/parallel/scatter_gather.py", line 104, in gather
res = gather_map(outputs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/parallel/scatter_gather.py", line 99, in gather_map
return type(out)(map(gather_map, zip(*outputs)))
File "/usr/local/lib/python3.10/dist-packages/torch/nn/parallel/scatter_gather.py", line 99, in gather_map
return type(out)(map(gather_map, zip(*outputs)))
File "/usr/local/lib/python3.10/dist-packages/torch/nn/parallel/scatter_gather.py", line 99, in gather_map
return type(out)(map(gather_map, zip(*outputs)))
[Previous line repeated 1 more time]
TypeError: 'int' object is not iterable
Traceback (most recent call last):
File "/home/yolonas/yolonas_project_template.py", line 153, in <module>
trainer.train(model=model,
File "/usr/local/lib/python3.10/dist-packages/super_gradients/training/sg_trainer/sg_trainer.py", line 1521, in train
train_metrics_tuple = self._train_epoch(context=context, silent_mode=silent_mode)
File "/usr/local/lib/python3.10/dist-packages/super_gradients/training/sg_trainer/sg_trainer.py", line 500, in _train_epoch
outputs = self.net(inputs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1520, in _call_impl
return forward_call(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/parallel/data_parallel.py", line 186, in forward
return self.gather(outputs, self.output_device)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/parallel/data_parallel.py", line 203, in gather
return gather(outputs, output_device, dim=self.dim)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/parallel/scatter_gather.py", line 104, in gather
res = gather_map(outputs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/parallel/scatter_gather.py", line 99, in gather_map
return type(out)(map(gather_map, zip(*outputs)))
File "/usr/local/lib/python3.10/dist-packages/torch/nn/parallel/scatter_gather.py", line 99, in gather_map
return type(out)(map(gather_map, zip(*outputs)))
File "/usr/local/lib/python3.10/dist-packages/torch/nn/parallel/scatter_gather.py", line 99, in gather_map
return type(out)(map(gather_map, zip(*outputs)))
[Previous line repeated 1 more time]
TypeError: 'int' object is not iterable
any idea?
@motidil I'm also having the same problem, did you find a fix?
@icaroryan unfortunately not yet...