Issue passing model into accelerate.prepare() --> Cannot convert to int without overflow
Info
Hi!
I am getting an error (seen below) when passing my model into accelerate.prepare() after I have already trained the model on one participant's worth of data. I've attached the start of the code where the error occurs.
The train() method (as seen in the main() below) returns the model when it's done with the current participant data. I am using a multinode multi GPU on SLURM.
I tried a different approach where I only passed the model into accelerate.prepare() once it was loaded in; however, I got a NCCL timeout error when starting the next participant.
I'm open to suggestions to reconfigure my training approach (note, each participant is like 4GB of data). Essentially, my objective is to train my model on a dataset for one participant at a time and need to continue distributing the data across the GPUs while keeping the same model trained throughout.
Here is the error: main()
File "/project/6037638/eobrie22/v2/ldm_train_v4.py", line 77, in main
train_loader,progress_bar,global_step,lr_scheduler,max_train_steps,test_loader, model = prepare_training(config,train_loader,accelerator,test_loader,optimizer,model)
File "/project/6037638/eobrie22/v2/ldm_train_v4.py", line 112, in prepare_training
train_loader, lr_scheduler,test_loader,optimizer,model = accelerator.prepare(train_loader, lr_scheduler,test_loader,optimizer,model)
File "/project/6037638/eobrie22/v2/myenv/lib/python3.10/site-packages/accelerate/accelerator.py", line 1213, in prepare
result = tuple(
File "/project/6037638/eobrie22/v2/myenv/lib/python3.10/site-packages/accelerate/accelerator.py", line 1214, in <genexpr>
self._prepare_one(obj, first_pass=True, device_placement=d) for obj, d in zip(args, device_placement)
File "/project/6037638/eobrie22/v2/myenv/lib/python3.10/site-packages/accelerate/accelerator.py", line 1094, in _prepare_one
return self.prepare_model(obj, device_placement=device_placement)
File "/project/6037638/eobrie22/v2/myenv/lib/python3.10/site-packages/accelerate/accelerator.py", line 1349, in prepare_model
model = torch.nn.parallel.DistributedDataParallel(
File "/project/6037638/eobrie22/v2/myenv/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 798, in __init__
result = tuple(
File "/project/6037638/eobrie22/v2/myenv/lib/python3.10/site-packages/accelerate/accelerator.py", line 1214, in <genexpr>
self._prepare_one(obj, first_pass=True, device_placement=d) for obj, d in zip(args, device_placement)
File "/project/6037638/eobrie22/v2/myenv/lib/python3.10/site-packages/accelerate/accelerator.py", line 1094, in _prepare_one
return self.prepare_model(obj, device_placement=device_placement)
File "/project/6037638/eobrie22/v2/myenv/lib/python3.10/site-packages/accelerate/accelerator.py", line 1349, in prepare_model
model = torch.nn.parallel.DistributedDataParallel(
File "/project/6037638/eobrie22/v2/myenv/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 798, in __init__
result = tuple(
File "/project/6037638/eobrie22/v2/myenv/lib/python3.10/site-packages/accelerate/accelerator.py", line 1214, in <genexpr>
self._prepare_one(obj, first_pass=True, device_placement=d) for obj, d in zip(args, device_placement)
File "/project/6037638/eobrie22/v2/myenv/lib/python3.10/site-packages/accelerate/accelerator.py", line 1094, in _prepare_one
return self.prepare_model(obj, device_placement=device_placement)
File "/project/6037638/eobrie22/v2/myenv/lib/python3.10/site-packages/accelerate/accelerator.py", line 1349, in prepare_model
model = torch.nn.parallel.DistributedDataParallel(
File "/project/6037638/eobrie22/v2/myenv/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 798, in __init__
_verify_param_shape_across_processes(self.process_group, parameters)
File "/project/6037638/eobrie22/v2/myenv/lib/python3.10/site-packages/torch/distributed/utils.py", line 263, in _verify_param_shape_across_processes
_verify_param_shape_across_processes(self.process_group, parameters)
File "/project/6037638/eobrie22/v2/myenv/lib/python3.10/site-packages/torch/distributed/utils.py", line 263, in _verify_param_shape_across_processes
_verify_param_shape_across_processes(self.process_group, parameters)
File "/project/6037638/eobrie22/v2/myenv/lib/python3.10/site-packages/torch/distributed/utils.py", line 263, in _verify_param_shape_across_processes
return dist._verify_params_across_processes(process_group, tensors, logger)
RuntimeError: value cannot be converted to type int without overflow
return dist._verify_params_across_processes(process_group, tensors, logger)
RuntimeError: value cannot be converted to type int without overflow
return dist._verify_params_across_processes(process_group, tensors, logger)
RuntimeError: value cannot be converted to type int without overflow
[2024-02-27 17:59:57,054] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 126965 closing signal SIGTERM
Reproduction
def main():
config = Config()
# Set up the accelerator
accelerator = Accelerator()
set_seed(42)
logger.info(accelerator.state)
# Load the models
model, noise_scheduler, optimizer = load_models(config,accelerator)
for participant in config.participants:
# Set up the EEG dataset
with accelerator.main_process_first():
if accelerator.is_main_process:
logger.info(f"Training participant {participant}")
train_dataset, test_dataset, tmp_path = load_and_preprocess_eeg_data(participant)
train_loader, test_loader = create_EEG_dataset(config.train_batch_size,train_dataset, test_dataset)
if accelerator.is_main_process:
logger.info(f"Data is {len(train_loader)} batches long")
# Prepare the training
train_loader,progress_bar,global_step,lr_scheduler,max_train_steps,test_loader, model = prepare_training(config,train_loader,accelerator,test_loader,optimizer,model)
# Train the model
logger.info(f"Rank: {accelerator.state.process_index}; Training started with {len(train_loader)} batches")
model = train(train_loader,accelerator, model, optimizer, lr_scheduler, config, global_step, progress_bar,participant, test_loader,noise_scheduler,max_train_steps)
def prepare_training(config,train_loader,accelerator,test_loader,optimizer,model):
num_update_steps_per_epoch = math.ceil(len(train_loader) / config.gradient_accumulation_steps)
max_train_steps = config.num_train_epochs * num_update_steps_per_epoch
num_warmup_steps = 0.1 * max_train_steps # 10% of max_train_steps as an example
num_training_steps = max_train_steps
global_step = 0
progress_bar = tqdm(range(0, max_train_steps),initial=global_step,desc="Steps",disable=not accelerator.is_local_main_process)
# Set up the learning rate scheduler
lr_scheduler = get_scheduler("linear",optimizer=optimizer,num_warmup_steps=num_warmup_steps * accelerator.num_processes,num_training_steps=num_training_steps * accelerator.num_processes)
# Prepare everything for the accelerator so that it can be used for distributed training
train_loader, lr_scheduler,test_loader,optimizer,model = accelerator.prepare(train_loader, lr_scheduler,test_loader,optimizer,model)
#weight_dtype = torch.float32
# Move the models to the accelerator device
logger.info(f"Model and Data prepared on {accelerator.state.process_index}")
return train_loader,progress_bar,global_step,lr_scheduler,max_train_steps,test_loader, model
Expected behavior
I expected the acclerate.prepare() to take the model that is currently being trained and then prepare it again with the new data. Potentially the issue is that it is already wrapped in the accelerator but my model is a pretty complex with a unet, vae, encoder and a few different layers built into it that got an error that the Accelerator could not find the transformer to wrap when I tried to do FSDP
RE: This is the error I get when I unwrap the model then try to prepare it again with a new dataset:
RuntimeError: value cannot be converted to type int without overflow return dist._verify_params_across_processes(process_group, tensors, logger)
Any help would be greatly appreciated
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.