CUDA Out of Memory Error on a 32GB GPU when Running trainer.py for Fine-Tuning
Description:
Hello, I encountered a torch.cuda.OutOfMemoryError while fine-tuning a model using trainer.py. My setup includes only a single GPU with 32GB of memory, and the error occurs even at the beginning of training.
modify trainer.py:
import argparse, os, sys, datetime
os.environ["CUDA_VISIBLE_DEVICES"] = "0"
from omegaconf import OmegaConf
from transformers import logging as transf_logging
import pytorch_lightning as pl
from pytorch_lightning import seed_everything
from pytorch_lightning.trainer import Trainer
import torch
sys.path.insert(1, os.path.join(sys.path[0], '..'))
from utils.utils import instantiate_from_config
from utils_train import get_trainer_callbacks, get_trainer_logger, get_trainer_strategy
from utils_train import set_logger, init_workspace, load_checkpoints
def get_parser(**parser_kwargs):
parser = argparse.ArgumentParser(**parser_kwargs)
parser.add_argument("--seed", "-s", type=int, default=20230211, help="seed for seed_everything")
parser.add_argument("--name", "-n", type=str, default="training_1024_v1.0", help="experiment name, as saving folder")
# parser.add_argument("--base", "-b", nargs="*", metavar="base_config.yaml", help="paths to base configs. Loaded from left-to-right. "
# "Parameters can be overwritten or added with command-line options of the form `--key value`.", default=list())
parser.add_argument(
"--base",
"-b",
nargs="*",
metavar="base_config.yaml",
help=(
"Paths to base configs. Loaded from left-to-right. "
"Parameters can be overwritten or added with command-line options of the form `--key value`."
),
default=["/home/cherry2025/DynamiCrafter/configs/training_1024_v1.0/config.yaml"]
)
parser.add_argument("--train", "-t", action='store_true', default=True, help='train')
parser.add_argument("--val", "-v", action='store_true', default=False, help='val')
parser.add_argument("--test", action='store_true', default=False, help='test')
parser.add_argument("--logdir", "-l", type=str, default="/home/cherry2025/DynamiCrafter/train_check", help="directory for logging dat shit")
parser.add_argument("--auto_resume", action='store_true', default=False, help="resume from full-info checkpoint")
parser.add_argument("--auto_resume_weight_only", action='store_true', default=False, help="resume from weight-only checkpoint")
parser.add_argument("--debug", "-d", action='store_true', default=False, help="enable post-mortem debugging")
return parser
# -----------------
# parser = argparse.ArgumentParser(**parser_kwargs)
# parser.add_argument("--seed", "-s", type=int, default=20230211, help="seed for seed_everything")
# parser.add_argument("--name", "-n", type=str, default="training_1024_v1.0",
# help="experiment name, as saving folder")
#
# # parser.add_argument("--base", "-b", nargs="*", metavar="base_config.yaml", help="paths to base configs. Loaded from left-to-right. "
# # "Parameters can be overwritten or added with command-line options of the form `--key value`.", default=list())
#
# parser.add_argument("--base", "-b", nargs="*", metavar="base_config.yaml",
# help="paths to base configs. Loaded from left-to-right. "
# "Parameters can be overwritten or added with command-line options of the form `--key value`.",
# default="/home/cherry2025/DynamiCrafter/configs/training_1024_v1.0/config.yaml")
#
# parser.add_argument("--train", "-t", action='store_true', default=False, help='train')
# parser.add_argument("--val", "-v", action='store_true', default=False, help='val')
# parser.add_argument("--test", action='store_true', default=False, help='test')
#
# parser.add_argument("--logdir", "-l", type=str, default="logs", help="directory for logging dat shit")
# parser.add_argument("--auto_resume", action='store_true', default=False, help="resume from full-info checkpoint")
# parser.add_argument("--auto_resume_weight_only", action='store_true', default=False,
# help="resume from weight-only checkpoint")
# parser.add_argument("--debug", "-d", action='store_true', default=False, help="enable post-mortem debugging")
#
# return parser
def get_nondefault_trainer_args(args):
parser = argparse.ArgumentParser()
parser = Trainer.add_argparse_args(parser)
default_trainer_args = parser.parse_args([])
return sorted(k for k in vars(default_trainer_args) if getattr(args, k) != getattr(default_trainer_args, k))
if __name__ == "__main__":
now = datetime.datetime.now().strftime("%Y-%m-%dT%H-%M-%S")
# add
local_rank = int(os.environ.get('LOCAL_RANK', 0))
global_rank = int(os.environ.get('RANK', 0))
num_rank = int(os.environ.get('WORLD_SIZE', 0))
# end
# local_rank = int(os.environ.get('LOCAL_RANK'))
# global_rank = int(os.environ.get('RANK'))
# num_rank = int(os.environ.get('WORLD_SIZE'))
parser = get_parser()
## Extends existing argparse by default Trainer attributes
parser = Trainer.add_argparse_args(parser)
args, unknown = parser.parse_known_args()
## disable transformer warning
transf_logging.set_verbosity_error()
seed_everything(args.seed)
## yaml configs: "model" | "data" | "lightning"
configs = [OmegaConf.load(cfg) for cfg in args.base]
# # add
# configs = []
# for cfg in args.base:
# config = OmegaConf.load(cfg)
# configs.append(config)
cli = OmegaConf.from_dotlist(unknown)
config = OmegaConf.merge(*configs, cli)
lightning_config = config.pop("lightning", OmegaConf.create())
trainer_config = lightning_config.get("trainer", OmegaConf.create())
## setup workspace directories
workdir, ckptdir, cfgdir, loginfo = init_workspace(args.name, args.logdir, config, lightning_config, global_rank)
logger = set_logger(logfile=os.path.join(loginfo, 'log_%d:%s.txt'%(global_rank, now)))
logger.info("@lightning version: %s [>=1.8 required]"%(pl.__version__))
## MODEL CONFIG >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
logger.info("***** Configing Model *****")
config.model.params.logdir = workdir
model = instantiate_from_config(config.model)
## load checkpoints
model = load_checkpoints(model, config.model)
## register_schedule again to make ZTSNR work
if model.rescale_betas_zero_snr:
model.register_schedule(given_betas=model.given_betas, beta_schedule=model.beta_schedule, timesteps=model.timesteps,
linear_start=model.linear_start, linear_end=model.linear_end, cosine_s=model.cosine_s)
## update trainer config
for k in get_nondefault_trainer_args(args):
trainer_config[k] = getattr(args, k)
# num_nodes = trainer_config.num_nodes
# ngpu_per_node = trainer_config.devices
# add
num_nodes = 1
ngpu_per_node = 1
logger.info(f"Running on {num_rank}={num_nodes}x{ngpu_per_node} GPUs")
## setup learning rate
base_lr = config.model.base_learning_rate
bs = config.data.params.batch_size
if getattr(config.model, 'scale_lr', True):
model.learning_rate = num_rank * bs * base_lr
else:
model.learning_rate = base_lr
## DATA CONFIG >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
logger.info("***** Configing Data *****")
data = instantiate_from_config(config.data)
data.setup()
for k in data.datasets:
logger.info(f"{k}, {data.datasets[k].__class__.__name__}, {len(data.datasets[k])}")
## TRAINER CONFIG >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
logger.info("***** Configing Trainer *****")
if "accelerator" not in trainer_config:
trainer_config["accelerator"] = "gpu"
## setup trainer args: pl-logger and callbacks
trainer_kwargs = dict()
trainer_kwargs["num_sanity_val_steps"] = 0
logger_cfg = get_trainer_logger(lightning_config, workdir, args.debug)
trainer_kwargs["logger"] = instantiate_from_config(logger_cfg)
## setup callbacks
callbacks_cfg = get_trainer_callbacks(lightning_config, config, workdir, ckptdir, logger)
trainer_kwargs["callbacks"] = [instantiate_from_config(callbacks_cfg[k]) for k in callbacks_cfg]
strategy_cfg = get_trainer_strategy(lightning_config)
trainer_kwargs["strategy"] = strategy_cfg if type(strategy_cfg) == str else instantiate_from_config(strategy_cfg)
trainer_kwargs['precision'] = lightning_config.get('precision', 32)
trainer_kwargs["sync_batchnorm"] = False
## trainer config: others
trainer_args = argparse.Namespace(**trainer_config)
trainer = Trainer.from_argparse_args(trainer_args, **trainer_kwargs)
## allow checkpointing via USR1
def melk(*args, **kwargs):
## run all checkpoint hooks
if trainer.global_rank == 0:
print("Summoning checkpoint.")
ckpt_path = os.path.join(ckptdir, "last_summoning.ckpt")
trainer.save_checkpoint(ckpt_path)
def divein(*args, **kwargs):
if trainer.global_rank == 0:
import pudb;
pudb.set_trace()
import signal
signal.signal(signal.SIGUSR1, melk)
signal.signal(signal.SIGUSR2, divein)
## Running LOOP >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
logger.info("***** Running the Loop *****")
if args.train:
try:
if "strategy" in lightning_config and lightning_config['strategy'].startswith('deepspeed'):
logger.info("<Training in DeepSpeed Mode>")
## deepspeed
if trainer_kwargs['precision'] == 16:
with torch.cuda.amp.autocast():
trainer.fit(model, data)
else:
trainer.fit(model, data)
else:
logger.info("<Training in DDPSharded Mode>") ## this is default
## ddpsharded
trainer.fit(model, data)
except Exception:
#melk()
raise
# if args.val:
# trainer.validate(model, data)
# if args.test or not trainer.interrupted:
# trainer.test(model, data)
Error Message:
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 360.00 MiB (GPU 0; 31.73 GiB total capacity; 29.84 GiB already allocated; 80.19 MiB free; 30.33 GiB reserved in total by PyTorch).
If reserved memory is >> allocated memory, try setting max_split_size_mb to avoid fragmentation.
See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF.
This error occurred at:
Epoch 0: 0%| | 1/80000 [00:53<1195:43:13, 53.81s/it, loss=0.347, v_num=6, train/loss_simple_step=0.347, train/loss_vlb_step=0.347, train/loss_step=0.347]
Steps to Reproduce:
- Run
trainer.pywith a single 32GB GPU. - Start the fine-tuning process.
- Error occurs shortly after the first epoch begins.
My Setup:
- GPU: Single 32GB GPU
Solutions Tried:
- Reduced the batch size to minimize memory usage.
Thank you for your assistance!
Hi What is your bs and your config.yaml? And can you try with more GPUs?
Hi What is your bs and your config.yaml? And can you try with more GPUs?
Hello, thank you for your response.
-
Batch Size: My batch size is set to
1. -
Config File (config.yaml): Below is the content of my config file:
model: pretrained_checkpoint: /home/cherry2025/DynamiCrafter/checkpoints/dynamicrafter_1024_v1/model.ckpt base_learning_rate: 1.0e-05 scale_lr: False target: lvdm.models.ddpm3d.LatentVisualDiffusion params: rescale_betas_zero_snr: True parameterization: "v" linear_start: 0.00085 linear_end: 0.012 num_timesteps_cond: 1 log_every_t: 200 timesteps: 1000 first_stage_key: video cond_stage_key: caption cond_stage_trainable: False image_proj_model_trainable: True conditioning_key: hybrid image_size: [72, 128] channels: 4 scale_by_std: False scale_factor: 0.18215 use_ema: False uncond_prob: 0.05 uncond_type: 'empty_seq' rand_cond_frame: true use_dynamic_rescale: true base_scale: 0.3 fps_condition_type: 'fps' perframe_ae: True unet_config: target: lvdm.modules.networks.openaimodel3d.UNetModel params: in_channels: 8 out_channels: 4 model_channels: 320 attention_resolutions: [4, 2, 1] num_res_blocks: 2 channel_mult: [1, 2, 4, 4] dropout: 0.1 num_head_channels: 64 transformer_depth: 1 context_dim: 1024 use_linear: true use_checkpoint: True temporal_conv: True temporal_attention: True temporal_selfatt_only: true use_relative_position: false use_causal_attention: False temporal_length: 16 addition_attention: true image_cross_attention: true default_fs: 10 fs_condition: true first_stage_config: target: lvdm.models.autoencoder.AutoencoderKL params: embed_dim: 4 monitor: val/rec_loss ddconfig: double_z: True z_channels: 4 resolution: 256 in_channels: 3 out_ch: 3 ch: 128 ch_mult: [1, 2, 4, 4] num_res_blocks: 2 attn_resolutions: [] dropout: 0.0 lossconfig: target: torch.nn.Identity cond_stage_config: target: lvdm.modules.encoders.condition.FrozenOpenCLIPEmbedder params: freeze: true layer: "penultimate" img_cond_stage_config: target: lvdm.modules.encoders.condition.FrozenOpenCLIPImageEmbedderV2 params: freeze: true image_proj_stage_config: target: lvdm.modules.encoders.resampler.Resampler params: dim: 1024 depth: 4 dim_head: 64 heads: 12 num_queries: 16 embedding_dim: 1280 output_dim: 1024 ff_mult: 4 video_length: 16 data: target: utils_data.DataModuleFromConfig params: batch_size: 1 num_workers: 2 wrap: false train: target: lvdm.data.webvid.WebVid params: data_dir: "/home/cherry2025/DynamiCrafter/train_data/img1" meta_path: "/home/cherry2025/DynamiCrafter/train_data/webvid10m_mini_80k.csv" video_length: 16 frame_stride: 6 load_raw_resolution: true resolution: [1024] spatial_transform: resize_center_crop random_fs: true lightning: precision: 16 trainer: benchmark: True accumulate_grad_batches: 2 max_steps: 100000 log_every_n_steps: 50 val_check_interval: 0.5 gradient_clip_algorithm: 'norm' gradient_clip_val: 0.5 callbacks: model_checkpoint: target: pytorch_lightning.callbacks.ModelCheckpoint params: every_n_train_steps: 9000 filename: "{epoch}-{step}" save_weights_only: True metrics_over_trainsteps_checkpoint: target: pytorch_lightning.callbacks.ModelCheckpoint params: filename: '{epoch}-{step}' save_weights_only: True every_n_train_steps: 10000 batch_logger: target: callbacks.ImageLogger params: batch_frequency: 500 to_local: False max_images: 8 log_images_kwargs: ddim_steps: 50 unconditional_guidance_scale: 7.5 timestep_spacing: uniform_trailing guidance_rescale: 0.7 -
Dataset: I use the training dataset consists of 80,000 images.
-
GPU Setup: Currently, I have access to a server with three 24GB GPUs.
Would this configuration be sufficient for the training process with this dataset size?
Thank you for your assistance!
Hi. I forget the exact hardware requirement for fine-tuning DynamiCrafter-1024 model. It may be hard for a single 32GB GPU... Multiple 32GB GPUs would be possible.
Hi. I forget the exact hardware requirement for fine-tuning DynamiCrafter-1024 model. It may be hard for a single 32GB GPU... Multiple 32GB GPUs would be possible.
Thank you!
Hi. I forget the exact hardware requirement for fine-tuning DynamiCrafter-1024 model. It may be hard for a single 32GB GPU... Multiple 32GB GPUs would be possible.
Hi,
I'm encountering a CUDA out of memory error while fine-tuning my model, even though I'm using 4 GPUs, each with 32GB of memory. Here’s the error message:
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 360.00 MiB (GPU 1; 31.73 GiB total capacity; 29.01 GiB already allocated; 102.19 MiB free; 31.14 GiB reserved in total by PyTorch)
If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
Traceback (most recent call last):
File "./main/trainer.py", line 396, in <module>
trainer.fit(model, data)
Configuration:
- Using 4 GPUs (each with 32GB memory).
- Running a fine-tuning task with PyTorch Lightning.
What I’ve Tried:
- Reducing
batch_size– Lowering batch size to reduce memory usage. Batch_size is 1.
Despite these efforts, the error persists. Any insights into why this might be happening, or additional suggestions to troubleshoot?
Hi Did you try to fine-tune the DynamiCrafter-512 model and check the memory usage?
Hi Did you try to fine-tune the DynamiCrafter-512 model and check the memory usage?
Hi,I try to
Hi Did you try to fine-tune the DynamiCrafter-512 model and check the memory usage?
Hello. The issue persists; what is going on? Thank you!