[BUG] Issue with using optimise_hyperparameter with PyTorch DDP
Hi community,
I have been stuck on this issue for some time now and would greatly appreciate any help! I am trying to run the optimise_hyperparameter function over 2 A100GPU using PyTorch DDP strategy.
When I run this I get the following error: RuntimeError: DDP expects same model across all ranks, but Rank 0 has 160 params, while rank 1 has inconsistent 137 params.
I have tried setting the seed across ranks but no luck. Has anyone experiences this issue or have an example of using this function and training a TFT with DDP?
I am using the latest package versions and training on an Azure VM. The application is run once I trigger the train_model function.
def prepare_data(data_prep_folder):
# Load in training and validation dataset
training = torch.load(f"{data_prep_folder}/{constants.TRAIN_DATASET_FILE_NAME}")
validation = torch.load(f"{data_prep_folder}/{constants.VALIDATION_DATASET_FILE_NAME}")
logger.info(f"Training set loaded with {len(training)} length.")
logger.info(f"Validation set loaded with {len(validation)} length.")
# Create dataloaders
train_dataloader = training.to_dataloader(
train=True,
batch_size=128,
num_workers=47,
pin_memory=True
)
val_dataloader = validation.to_dataloader(
train=False,
batch_size=128,
num_workers=47,
pin_memory=True
)
logger.info(f"Dataloaders created with 128 batch size and 47 workers.")
return train_dataloader, val_dataloader
def hyperparameter_tuner(train_dataloader, val_dataloader, model_train_folder): # Start time start_time = time.time() logger.info("Starting hyperparameter tuning...")
# Create study
study = optimize_hyperparameters(
train_dataloader,
val_dataloader,
model_path=model_train_folder,
n_trials=2,
max_epochs=30,
gradient_clip_val_range=(0.01, 1.0),
hidden_size_range=(8, 128),
hidden_continuous_size_range=(8, 128),
attention_head_size_range=(1, 4),
learning_rate_range=(0.001, 0.1),
dropout_range=(0.1, 0.3),
trainer_kwargs=dict(
accelerator='gpu',
strategy=DDPStrategy(),
devices='auto',
limit_train_batches=10
),
reduce_on_plateau_patience=4,
use_learning_rate_finder=False
)
logger.info("Hyperparameter tuning finished.")
# Get best parameters
best_params = study.best_trial.params
logger.info(f"Best trial parameters: {best_params}")
training_time = time.time() - start_time
hours, remainder = divmod(training_time, 3600)
minutes, seconds = divmod(remainder, 60)
logger.info(f"Tuning took {int(hours)} hours, {int(minutes)} minutes, and {int(seconds)} seconds.")
return best_params
Can anyone help here? How can I use DDP with the optimize_hyperparameter function?
Potentially related to the windows failures reported here: https://github.com/jdb78/pytorch-forecasting/issues/1623
Can you kindly paste the full output of pip list, from your python environment, and also let us know what your operating system and python version are?
Same issue, i think its caused by using more than 1 device. Might be related https://discuss.pytorch.org/t/cant-distribute-data-to-all-gpus-with-ddp/184749/4
try to add "NCCL_P2P_DISABLE=1" in your cmd. I tried this to my cmd and "CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 NCCL_P2P_DISABLE=1 torchrun --nnode=1 --nproc_per_node=8 --master_port=12345 run.py" can work