DynamiCrafter CUDA Out of Memory Error on a 32GB GPU when Running trainer.py for Fine-Tuning

Description:
Hello, I encountered a torch.cuda.OutOfMemoryError while fine-tuning a model using trainer.py. My setup includes only a single GPU with 32GB of memory, and the error occurs even at the beginning of training.

modify trainer.py:

import argparse, os, sys, datetime
os.environ["CUDA_VISIBLE_DEVICES"] = "0"
from omegaconf import OmegaConf
from transformers import logging as transf_logging
import pytorch_lightning as pl
from pytorch_lightning import seed_everything
from pytorch_lightning.trainer import Trainer
import torch
sys.path.insert(1, os.path.join(sys.path[0], '..'))
from utils.utils import instantiate_from_config
from utils_train import get_trainer_callbacks, get_trainer_logger, get_trainer_strategy
from utils_train import set_logger, init_workspace, load_checkpoints




def get_parser(**parser_kwargs):
    parser = argparse.ArgumentParser(**parser_kwargs)
    parser.add_argument("--seed", "-s", type=int, default=20230211, help="seed for seed_everything")
    parser.add_argument("--name", "-n", type=str, default="training_1024_v1.0", help="experiment name, as saving folder")

    # parser.add_argument("--base", "-b", nargs="*", metavar="base_config.yaml", help="paths to base configs. Loaded from left-to-right. "
    #                         "Parameters can be overwritten or added with command-line options of the form `--key value`.", default=list())

    parser.add_argument(
        "--base",
        "-b",
        nargs="*",
        metavar="base_config.yaml",
        help=(
            "Paths to base configs. Loaded from left-to-right. "
            "Parameters can be overwritten or added with command-line options of the form `--key value`."
        ),
        default=["/home/cherry2025/DynamiCrafter/configs/training_1024_v1.0/config.yaml"]
    )

    parser.add_argument("--train", "-t", action='store_true', default=True, help='train')
    parser.add_argument("--val", "-v", action='store_true', default=False, help='val')
    parser.add_argument("--test", action='store_true', default=False, help='test')

    parser.add_argument("--logdir", "-l", type=str, default="/home/cherry2025/DynamiCrafter/train_check", help="directory for logging dat shit")
    parser.add_argument("--auto_resume", action='store_true', default=False, help="resume from full-info checkpoint")
    parser.add_argument("--auto_resume_weight_only", action='store_true', default=False, help="resume from weight-only checkpoint")
    parser.add_argument("--debug", "-d", action='store_true', default=False, help="enable post-mortem debugging")

    return parser

#     -----------------
#     parser = argparse.ArgumentParser(**parser_kwargs)
#     parser.add_argument("--seed", "-s", type=int, default=20230211, help="seed for seed_everything")
#     parser.add_argument("--name", "-n", type=str, default="training_1024_v1.0",
#                         help="experiment name, as saving folder")
#
#     # parser.add_argument("--base", "-b", nargs="*", metavar="base_config.yaml", help="paths to base configs. Loaded from left-to-right. "
#     #                         "Parameters can be overwritten or added with command-line options of the form `--key value`.", default=list())
#
#     parser.add_argument("--base", "-b", nargs="*", metavar="base_config.yaml",
#                         help="paths to base configs. Loaded from left-to-right. "
#                              "Parameters can be overwritten or added with command-line options of the form `--key value`.",
#                         default="/home/cherry2025/DynamiCrafter/configs/training_1024_v1.0/config.yaml")
#
#     parser.add_argument("--train", "-t", action='store_true', default=False, help='train')
#     parser.add_argument("--val", "-v", action='store_true', default=False, help='val')
#     parser.add_argument("--test", action='store_true', default=False, help='test')
#
#     parser.add_argument("--logdir", "-l", type=str, default="logs", help="directory for logging dat shit")
#     parser.add_argument("--auto_resume", action='store_true', default=False, help="resume from full-info checkpoint")
#     parser.add_argument("--auto_resume_weight_only", action='store_true', default=False,
#                         help="resume from weight-only checkpoint")
#     parser.add_argument("--debug", "-d", action='store_true', default=False, help="enable post-mortem debugging")
#
#     return parser
    
def get_nondefault_trainer_args(args):
    parser = argparse.ArgumentParser()
    parser = Trainer.add_argparse_args(parser)
    default_trainer_args = parser.parse_args([])
    return sorted(k for k in vars(default_trainer_args) if getattr(args, k) != getattr(default_trainer_args, k))


if __name__ == "__main__":
    now = datetime.datetime.now().strftime("%Y-%m-%dT%H-%M-%S")
    # add
    local_rank = int(os.environ.get('LOCAL_RANK', 0))
    global_rank = int(os.environ.get('RANK', 0))
    num_rank = int(os.environ.get('WORLD_SIZE', 0))
    # end
    # local_rank = int(os.environ.get('LOCAL_RANK'))
    # global_rank = int(os.environ.get('RANK'))
    # num_rank = int(os.environ.get('WORLD_SIZE'))

    parser = get_parser()
    ## Extends existing argparse by default Trainer attributes
    parser = Trainer.add_argparse_args(parser)
    args, unknown = parser.parse_known_args()
    ## disable transformer warning
    transf_logging.set_verbosity_error()
    seed_everything(args.seed)

    ## yaml configs: "model" | "data" | "lightning"
    configs = [OmegaConf.load(cfg) for cfg in args.base]
    # # add
    # configs = []
    # for cfg in args.base:
    #     config = OmegaConf.load(cfg)
    #     configs.append(config)

    cli = OmegaConf.from_dotlist(unknown)
    config = OmegaConf.merge(*configs, cli)
    lightning_config = config.pop("lightning", OmegaConf.create())
    trainer_config = lightning_config.get("trainer", OmegaConf.create()) 

    ## setup workspace directories
    workdir, ckptdir, cfgdir, loginfo = init_workspace(args.name, args.logdir, config, lightning_config, global_rank)
    logger = set_logger(logfile=os.path.join(loginfo, 'log_%d:%s.txt'%(global_rank, now)))
    logger.info("@lightning version: %s [>=1.8 required]"%(pl.__version__))  

    ## MODEL CONFIG >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
    logger.info("***** Configing Model *****")
    config.model.params.logdir = workdir
    model = instantiate_from_config(config.model)

    ## load checkpoints
    model = load_checkpoints(model, config.model)

    ## register_schedule again to make ZTSNR work
    if model.rescale_betas_zero_snr:
        model.register_schedule(given_betas=model.given_betas, beta_schedule=model.beta_schedule, timesteps=model.timesteps,
                                linear_start=model.linear_start, linear_end=model.linear_end, cosine_s=model.cosine_s)

    ## update trainer config
    for k in get_nondefault_trainer_args(args):
        trainer_config[k] = getattr(args, k)
        
    # num_nodes = trainer_config.num_nodes
    # ngpu_per_node = trainer_config.devices
    # add
    num_nodes = 1
    ngpu_per_node = 1
    logger.info(f"Running on {num_rank}={num_nodes}x{ngpu_per_node} GPUs")

    ## setup learning rate
    base_lr = config.model.base_learning_rate
    bs = config.data.params.batch_size
    if getattr(config.model, 'scale_lr', True):
        model.learning_rate = num_rank * bs * base_lr
    else:
        model.learning_rate = base_lr


    ## DATA CONFIG >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
    logger.info("***** Configing Data *****")
    data = instantiate_from_config(config.data)
    data.setup()
    for k in data.datasets:
        logger.info(f"{k}, {data.datasets[k].__class__.__name__}, {len(data.datasets[k])}")


    ## TRAINER CONFIG >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
    logger.info("***** Configing Trainer *****")
    if "accelerator" not in trainer_config:
        trainer_config["accelerator"] = "gpu"

    ## setup trainer args: pl-logger and callbacks
    trainer_kwargs = dict()
    trainer_kwargs["num_sanity_val_steps"] = 0
    logger_cfg = get_trainer_logger(lightning_config, workdir, args.debug)
    trainer_kwargs["logger"] = instantiate_from_config(logger_cfg)
    
    ## setup callbacks
    callbacks_cfg = get_trainer_callbacks(lightning_config, config, workdir, ckptdir, logger)
    trainer_kwargs["callbacks"] = [instantiate_from_config(callbacks_cfg[k]) for k in callbacks_cfg]
    strategy_cfg = get_trainer_strategy(lightning_config)
    trainer_kwargs["strategy"] = strategy_cfg if type(strategy_cfg) == str else instantiate_from_config(strategy_cfg)
    trainer_kwargs['precision'] = lightning_config.get('precision', 32)
    trainer_kwargs["sync_batchnorm"] = False

    ## trainer config: others

    trainer_args = argparse.Namespace(**trainer_config)
    trainer = Trainer.from_argparse_args(trainer_args, **trainer_kwargs)

    ## allow checkpointing via USR1
    def melk(*args, **kwargs):
        ## run all checkpoint hooks
        if trainer.global_rank == 0:
            print("Summoning checkpoint.")
            ckpt_path = os.path.join(ckptdir, "last_summoning.ckpt")
            trainer.save_checkpoint(ckpt_path)

    def divein(*args, **kwargs):
        if trainer.global_rank == 0:
            import pudb;
            pudb.set_trace()

    import signal
    signal.signal(signal.SIGUSR1, melk)
    signal.signal(signal.SIGUSR2, divein)

    ## Running LOOP >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
    logger.info("***** Running the Loop *****")
    if args.train:
        try:
            if "strategy" in lightning_config and lightning_config['strategy'].startswith('deepspeed'):
                logger.info("<Training in DeepSpeed Mode>")
                ## deepspeed
                if trainer_kwargs['precision'] == 16:
                    with torch.cuda.amp.autocast():
                        trainer.fit(model, data)
                else:
                    trainer.fit(model, data)
            else:
                logger.info("<Training in DDPSharded Mode>") ## this is default
                ## ddpsharded
                trainer.fit(model, data)
        except Exception:
            #melk()
            raise

    # if args.val:
    #     trainer.validate(model, data)
    # if args.test or not trainer.interrupted:
    #     trainer.test(model, data)

Error Message:

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 360.00 MiB (GPU 0; 31.73 GiB total capacity; 29.84 GiB already allocated; 80.19 MiB free; 30.33 GiB reserved in total by PyTorch). 
If reserved memory is >> allocated memory, try setting max_split_size_mb to avoid fragmentation. 
See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF.

This error occurred at:

Epoch 0:   0%|          | 1/80000 [00:53<1195:43:13, 53.81s/it, loss=0.347, v_num=6, train/loss_simple_step=0.347, train/loss_vlb_step=0.347, train/loss_step=0.347]

Steps to Reproduce:

Run trainer.py with a single 32GB GPU.
Start the fine-tuning process.
Error occurs shortly after the first epoch begins.

My Setup:

GPU: Single 32GB GPU

Solutions Tried:

Reduced the batch size to minimize memory usage.

Thank you for your assistance!

Oct 29 '24 08:10 xlnn

Hi What is your bs and your config.yaml? And can you try with more GPUs?

Oct 29 '24 09:10 Doubiiu

Hi What is your bs and your config.yaml? And can you try with more GPUs?

Hello, thank you for your response.

Batch Size: My batch size is set to 1.

Config File (config.yaml): Below is the content of my config file:

model:
  pretrained_checkpoint: /home/cherry2025/DynamiCrafter/checkpoints/dynamicrafter_1024_v1/model.ckpt
  base_learning_rate: 1.0e-05
  scale_lr: False
  target: lvdm.models.ddpm3d.LatentVisualDiffusion
  params:
    rescale_betas_zero_snr: True
    parameterization: "v"
    linear_start: 0.00085
    linear_end: 0.012
    num_timesteps_cond: 1
    log_every_t: 200
    timesteps: 1000
    first_stage_key: video
    cond_stage_key: caption
    cond_stage_trainable: False
    image_proj_model_trainable: True
    conditioning_key: hybrid
    image_size: [72, 128]
    channels: 4
    scale_by_std: False
    scale_factor: 0.18215
    use_ema: False
    uncond_prob: 0.05
    uncond_type: 'empty_seq'
    rand_cond_frame: true
    use_dynamic_rescale: true
    base_scale: 0.3
    fps_condition_type: 'fps'
    perframe_ae: True
    unet_config:
      target: lvdm.modules.networks.openaimodel3d.UNetModel
      params:
        in_channels: 8
        out_channels: 4
        model_channels: 320
        attention_resolutions: [4, 2, 1]
        num_res_blocks: 2
        channel_mult: [1, 2, 4, 4]
        dropout: 0.1
        num_head_channels: 64
        transformer_depth: 1
        context_dim: 1024
        use_linear: true
        use_checkpoint: True
        temporal_conv: True
        temporal_attention: True
        temporal_selfatt_only: true
        use_relative_position: false
        use_causal_attention: False
        temporal_length: 16
        addition_attention: true
        image_cross_attention: true
        default_fs: 10
        fs_condition: true
    first_stage_config:
      target: lvdm.models.autoencoder.AutoencoderKL
      params:
        embed_dim: 4
        monitor: val/rec_loss
        ddconfig:
          double_z: True
          z_channels: 4
          resolution: 256
          in_channels: 3
          out_ch: 3
          ch: 128
          ch_mult: [1, 2, 4, 4]
          num_res_blocks: 2
          attn_resolutions: []
          dropout: 0.0
        lossconfig:
          target: torch.nn.Identity
    cond_stage_config:
      target: lvdm.modules.encoders.condition.FrozenOpenCLIPEmbedder
      params:
        freeze: true
        layer: "penultimate"
    img_cond_stage_config:
      target: lvdm.modules.encoders.condition.FrozenOpenCLIPImageEmbedderV2
      params:
        freeze: true
    image_proj_stage_config:
      target: lvdm.modules.encoders.resampler.Resampler
      params:
        dim: 1024
        depth: 4
        dim_head: 64
        heads: 12
        num_queries: 16
        embedding_dim: 1280
        output_dim: 1024
        ff_mult: 4
        video_length: 16
data:
  target: utils_data.DataModuleFromConfig
  params:
    batch_size: 1
    num_workers: 2
    wrap: false
    train:
      target: lvdm.data.webvid.WebVid
      params:
        data_dir: "/home/cherry2025/DynamiCrafter/train_data/img1"
        meta_path: "/home/cherry2025/DynamiCrafter/train_data/webvid10m_mini_80k.csv"
        video_length: 16
        frame_stride: 6
        load_raw_resolution: true
        resolution: [1024]
        spatial_transform: resize_center_crop
        random_fs: true
lightning:
  precision: 16
  trainer:
    benchmark: True
    accumulate_grad_batches: 2
    max_steps: 100000
    log_every_n_steps: 50
    val_check_interval: 0.5
    gradient_clip_algorithm: 'norm'
    gradient_clip_val: 0.5
  callbacks:
    model_checkpoint:
      target: pytorch_lightning.callbacks.ModelCheckpoint
      params:
        every_n_train_steps: 9000
        filename: "{epoch}-{step}"
        save_weights_only: True
    metrics_over_trainsteps_checkpoint:
      target: pytorch_lightning.callbacks.ModelCheckpoint
      params:
        filename: '{epoch}-{step}'
        save_weights_only: True
        every_n_train_steps: 10000
    batch_logger:
      target: callbacks.ImageLogger
      params:
        batch_frequency: 500
        to_local: False
        max_images: 8
        log_images_kwargs:
          ddim_steps: 50
          unconditional_guidance_scale: 7.5
          timestep_spacing: uniform_trailing
          guidance_rescale: 0.7

Dataset: I use the training dataset consists of 80,000 images.
GPU Setup: Currently, I have access to a server with three 24GB GPUs.

Would this configuration be sufficient for the training process with this dataset size?

Thank you for your assistance!

Oct 29 '24 09:10 xlnn

Hi. I forget the exact hardware requirement for fine-tuning DynamiCrafter-1024 model. It may be hard for a single 32GB GPU... Multiple 32GB GPUs would be possible.

Oct 29 '24 22:10 Doubiiu

Hi. I forget the exact hardware requirement for fine-tuning DynamiCrafter-1024 model. It may be hard for a single 32GB GPU... Multiple 32GB GPUs would be possible.

Thank you!

Oct 30 '24 11:10 xlnn

Hi. I forget the exact hardware requirement for fine-tuning DynamiCrafter-1024 model. It may be hard for a single 32GB GPU... Multiple 32GB GPUs would be possible.

Hi,

I'm encountering a CUDA out of memory error while fine-tuning my model, even though I'm using 4 GPUs, each with 32GB of memory. Here’s the error message:

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 360.00 MiB (GPU 1; 31.73 GiB total capacity; 29.01 GiB already allocated; 102.19 MiB free; 31.14 GiB reserved in total by PyTorch)
If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
Traceback (most recent call last):
  File "./main/trainer.py", line 396, in <module>
    trainer.fit(model, data)

Configuration:

Using 4 GPUs (each with 32GB memory).
Running a fine-tuning task with PyTorch Lightning.

What I’ve Tried:

Reducing batch_size – Lowering batch size to reduce memory usage. Batch_size is 1.

Despite these efforts, the error persists. Any insights into why this might be happening, or additional suggestions to troubleshoot?

Oct 30 '24 14:10 xlnn

Hi Did you try to fine-tune the DynamiCrafter-512 model and check the memory usage?

Oct 30 '24 17:10 Doubiiu

Hi Did you try to fine-tune the DynamiCrafter-512 model and check the memory usage?

Hi，I try to

Oct 31 '24 09:10 xlnn

Hi Did you try to fine-tune the DynamiCrafter-512 model and check the memory usage?

Hello. The issue persists; what is going on? Thank you!

Oct 31 '24 10:10 xlnn