super-gradients icon indicating copy to clipboard operation
super-gradients copied to clipboard

Multiple GPUs on training

Open ferdzo opened this issue 1 year ago • 3 comments

💡 Your Question

I have two available GPUs to try and train my model, but I have no luck finding the right training parameter to enable it.

train_params = {
    # ENABLING SILENT MODE
    "MultiGPUMode": "DP",
   # "resume":True,
    "average_best_models":True,
    "warmup_mode": "linear_epoch_step",
    "warmup_initial_lr": 1e-6,
    "lr_warmup_epochs": 3,
    "initial_lr": 5e-4,
    "lr_mode": "cosine",
    "cosine_final_lr_ratio": 0.1,
    "optimizer": "AdamW",
    "optimizer_params": {"weight_decay": 0.0001},
    "zero_weight_decay_on_bias_and_bn": True,
        "ema": True,
    "ema_params": {"decay": 0.9, "decay_type": "threshold"},
    "max_epochs": 20,
    "mixed_precision": True,
    "loss": PPYoloELoss(
        use_static_assigner=False,
        num_classes=config.NUM_CLASSES,
        reg_max=16
    ),
    "valid_metrics_list": [
        DetectionMetrics_050(
            score_thres=0.1,
            top_k_predictions=300,
            num_cls=config.NUM_CLASSES,
            normalize_targets=True,
            post_prediction_callback=PPYoloEPostPredictionCallback(
                score_threshold=0.01,
                nms_top_k=1000,
                max_predictions=300,
                nms_threshold=0.7
            )
        )
    ],
    "metric_to_watch": '[email protected]'
}

These are the training parameters that I use, I've tried "MultiGPUMode": "DP", "MultiGPUMode": "AUTO", also "multi_gpu": "AUTO", but after I start training I get the same message:

[2023-06-16] INFO - sg_trainer_utils.py - TRAINING PARAMETERS:
    - Mode:                         Single GPU
    - Number of GPUs:               1          (2 available on the machine)
    - Dataset size:                 9245       (len(train_set))
    - Batch size per GPU:           8          (batch_size)
    - Batch Accumulate:             1          (batch_accumulate)
    - Total batch size:             8          (num_gpus * batch_size)
    - Effective Batch size:         8          (num_gpus * batch_size * batch_accumulate)
    - Iterations per epoch:         1155       (len(train_loader))
    - Gradient updates per epoch:   1155       (len(train_loader) / batch_accumulate)

[2023-06-16  INFO - sg_trainer.py - Started training for 20 epochs (0/19)

There is a link in the documentation on how to use multiple GPUs, but it is empty.

What parameter should I enter for multiple GPU support?

Versions

No response

ferdzo avatar Jun 16 '23 17:06 ferdzo

You can refer to documentation to set up your config accordingly: https://docs.deci.ai/super-gradients/documentation/source/device.html#4-ddp-distributed-data-parallel

Please note, that running DDP from the Colab/Jupyter is theoretically possible, but tricky and is not covered in documentation. For DDP we suggest to use regular python scripts and recipes (you for sure can have your own) and run them in one of ways:

  • https://github.com/Deci-AI/super-gradients/blob/master/src/super_gradients/recipes/roboflow_yolo_nas_s.yaml#L7
  • Have your own train.py file similar to this: https://github.com/Deci-AI/super-gradients/blob/master/src/super_gradients/train_from_recipe.py

Hope this answers your question. Let us know whether this resolves your issue.

BloodAxe avatar Jun 16 '23 18:06 BloodAxe

Thank you for the fast reply,

I will surely look into it.

ferdzo avatar Jun 16 '23 18:06 ferdzo

@ferdzo I too have the same problem. If you get the solution please let me know. I will also update here if I get.

harivinod3 avatar Jun 19 '23 13:06 harivinod3

Waiting for a simple answer. Please paste some code that works.

Thank you

helloansuman avatar Jun 22 '23 06:06 helloansuman

I have the same issue...working with super-gradients version 3.6.0 set up the trainer with the magic function setup_device(multi_gpu='DP', num_gpus=2) and the following train params:

train_params = {
    # ENABLING SILENT MODE
    "average_best_models": True,
    "warmup_mode": "LinearBatchLRWarmup",
    "warmup_initial_lr": 1e-4,
    "lr_warmup_steps": 100,
    "lr_warmup_epochs": 3,
    "initial_lr": 33e-3,
    "lr_mode": "cosine",
    "cosine_final_lr_ratio": 0.1,
    "optimizer": "SGD",
    "optimizer_params": {"weight_decay": 0.0005},
    "zero_weight_decay_on_bias_and_bn": True,
    "ema": True,
    "ema_params": {"decay": 0.937, "decay_type": "threshold"},
    "enable_qat": True,
    "cache_annotations": True,  
    # "dataset_statistics": True,
    "max_epochs": 30,
    "mixed_precision": True,
    "loss": PPYoloELoss(
        use_static_assigner=False,
        num_classes=config.NUM_CLASSES,
        # reg_max=config.BATCH_SIZE
    ),
    "valid_metrics_list": [
        DetectionMetrics_050(
            score_thres=0.1,
            top_k_predictions=300,
            num_cls=config.NUM_CLASSES,
            normalize_targets=True,
            post_prediction_callback=PPYoloEPostPredictionCallback(
                score_threshold=0.01,
                nms_top_k=1000,
                max_predictions=300,
                nms_threshold=0.7
            )
        )
    ],
    "metric_to_watch": '[email protected]',
    "greater_metric_to_watch_is_better": True,

}

but it seems like it is still not recognizing all the GPUs...

- Mode:                         DATA_PARALLEL
    - Number of GPUs:               1          (2 available on the machine)
    - Full dataset size:            19970      (len(train_set))
    - Batch size per GPU:           48         (batch_size)
    - Batch Accumulate:             1          (batch_accumulate)
    - Total batch size:             48         (num_gpus * batch_size)
    - Effective Batch size:         48         (num_gpus * batch_size * batch_accumulate)
    - Iterations per epoch:         416        (len(train_loader))
    - Gradient updates per epoch:   416        (len(train_loader) / batch_accumulate)
    - Model: YoloNAS_L  (66.91M parameters, 66.91M optimized)
    - Learning Rates and Weight Decays:
      - default: (66.91M parameters). LR: 0.033 (66.91M parameters) WD: 0.0, (84.69K parameters), WD: 0.0005, (66.82M parameters)

and getting this error:

Traceback (most recent call last):
  File "/home/yolonas/yolonas_project_template.py", line 153, in <module>
    trainer.train(model=model,
  File "/usr/local/lib/python3.10/dist-packages/super_gradients/training/sg_trainer/sg_trainer.py", line 1521, in train
    train_metrics_tuple = self._train_epoch(context=context, silent_mode=silent_mode)
  File "/usr/local/lib/python3.10/dist-packages/super_gradients/training/sg_trainer/sg_trainer.py", line 500, in _train_epoch
    outputs = self.net(inputs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/parallel/data_parallel.py", line 186, in forward
    return self.gather(outputs, self.output_device)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/parallel/data_parallel.py", line 203, in gather
    return gather(outputs, output_device, dim=self.dim)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/parallel/scatter_gather.py", line 104, in gather
    res = gather_map(outputs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/parallel/scatter_gather.py", line 99, in gather_map
    return type(out)(map(gather_map, zip(*outputs)))
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/parallel/scatter_gather.py", line 99, in gather_map
    return type(out)(map(gather_map, zip(*outputs)))
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/parallel/scatter_gather.py", line 99, in gather_map
    return type(out)(map(gather_map, zip(*outputs)))
  [Previous line repeated 1 more time]
TypeError: 'int' object is not iterable
Traceback (most recent call last):
  File "/home/yolonas/yolonas_project_template.py", line 153, in <module>
    trainer.train(model=model,
  File "/usr/local/lib/python3.10/dist-packages/super_gradients/training/sg_trainer/sg_trainer.py", line 1521, in train
    train_metrics_tuple = self._train_epoch(context=context, silent_mode=silent_mode)
  File "/usr/local/lib/python3.10/dist-packages/super_gradients/training/sg_trainer/sg_trainer.py", line 500, in _train_epoch
    outputs = self.net(inputs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/parallel/data_parallel.py", line 186, in forward
    return self.gather(outputs, self.output_device)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/parallel/data_parallel.py", line 203, in gather
    return gather(outputs, output_device, dim=self.dim)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/parallel/scatter_gather.py", line 104, in gather
    res = gather_map(outputs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/parallel/scatter_gather.py", line 99, in gather_map
    return type(out)(map(gather_map, zip(*outputs)))
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/parallel/scatter_gather.py", line 99, in gather_map
    return type(out)(map(gather_map, zip(*outputs)))
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/parallel/scatter_gather.py", line 99, in gather_map
    return type(out)(map(gather_map, zip(*outputs)))
  [Previous line repeated 1 more time]
TypeError: 'int' object is not iterable

any idea?

motidil avatar Feb 21 '24 16:02 motidil

@motidil I'm also having the same problem, did you find a fix?

icaroryan avatar Mar 02 '24 04:03 icaroryan

@icaroryan unfortunately not yet...

motidil avatar Mar 03 '24 09:03 motidil