DynamiCrafter icon indicating copy to clipboard operation
DynamiCrafter copied to clipboard

CUDA Out of Memory Error on a 32GB GPU when Running trainer.py for Fine-Tuning

Open xlnn opened this issue 1 year ago • 8 comments

Description:
Hello, I encountered a torch.cuda.OutOfMemoryError while fine-tuning a model using trainer.py. My setup includes only a single GPU with 32GB of memory, and the error occurs even at the beginning of training.

modify trainer.py:

import argparse, os, sys, datetime
os.environ["CUDA_VISIBLE_DEVICES"] = "0"
from omegaconf import OmegaConf
from transformers import logging as transf_logging
import pytorch_lightning as pl
from pytorch_lightning import seed_everything
from pytorch_lightning.trainer import Trainer
import torch
sys.path.insert(1, os.path.join(sys.path[0], '..'))
from utils.utils import instantiate_from_config
from utils_train import get_trainer_callbacks, get_trainer_logger, get_trainer_strategy
from utils_train import set_logger, init_workspace, load_checkpoints




def get_parser(**parser_kwargs):
    parser = argparse.ArgumentParser(**parser_kwargs)
    parser.add_argument("--seed", "-s", type=int, default=20230211, help="seed for seed_everything")
    parser.add_argument("--name", "-n", type=str, default="training_1024_v1.0", help="experiment name, as saving folder")

    # parser.add_argument("--base", "-b", nargs="*", metavar="base_config.yaml", help="paths to base configs. Loaded from left-to-right. "
    #                         "Parameters can be overwritten or added with command-line options of the form `--key value`.", default=list())

    parser.add_argument(
        "--base",
        "-b",
        nargs="*",
        metavar="base_config.yaml",
        help=(
            "Paths to base configs. Loaded from left-to-right. "
            "Parameters can be overwritten or added with command-line options of the form `--key value`."
        ),
        default=["/home/cherry2025/DynamiCrafter/configs/training_1024_v1.0/config.yaml"]
    )

    parser.add_argument("--train", "-t", action='store_true', default=True, help='train')
    parser.add_argument("--val", "-v", action='store_true', default=False, help='val')
    parser.add_argument("--test", action='store_true', default=False, help='test')

    parser.add_argument("--logdir", "-l", type=str, default="/home/cherry2025/DynamiCrafter/train_check", help="directory for logging dat shit")
    parser.add_argument("--auto_resume", action='store_true', default=False, help="resume from full-info checkpoint")
    parser.add_argument("--auto_resume_weight_only", action='store_true', default=False, help="resume from weight-only checkpoint")
    parser.add_argument("--debug", "-d", action='store_true', default=False, help="enable post-mortem debugging")

    return parser

#     -----------------
#     parser = argparse.ArgumentParser(**parser_kwargs)
#     parser.add_argument("--seed", "-s", type=int, default=20230211, help="seed for seed_everything")
#     parser.add_argument("--name", "-n", type=str, default="training_1024_v1.0",
#                         help="experiment name, as saving folder")
#
#     # parser.add_argument("--base", "-b", nargs="*", metavar="base_config.yaml", help="paths to base configs. Loaded from left-to-right. "
#     #                         "Parameters can be overwritten or added with command-line options of the form `--key value`.", default=list())
#
#     parser.add_argument("--base", "-b", nargs="*", metavar="base_config.yaml",
#                         help="paths to base configs. Loaded from left-to-right. "
#                              "Parameters can be overwritten or added with command-line options of the form `--key value`.",
#                         default="/home/cherry2025/DynamiCrafter/configs/training_1024_v1.0/config.yaml")
#
#     parser.add_argument("--train", "-t", action='store_true', default=False, help='train')
#     parser.add_argument("--val", "-v", action='store_true', default=False, help='val')
#     parser.add_argument("--test", action='store_true', default=False, help='test')
#
#     parser.add_argument("--logdir", "-l", type=str, default="logs", help="directory for logging dat shit")
#     parser.add_argument("--auto_resume", action='store_true', default=False, help="resume from full-info checkpoint")
#     parser.add_argument("--auto_resume_weight_only", action='store_true', default=False,
#                         help="resume from weight-only checkpoint")
#     parser.add_argument("--debug", "-d", action='store_true', default=False, help="enable post-mortem debugging")
#
#     return parser
    
def get_nondefault_trainer_args(args):
    parser = argparse.ArgumentParser()
    parser = Trainer.add_argparse_args(parser)
    default_trainer_args = parser.parse_args([])
    return sorted(k for k in vars(default_trainer_args) if getattr(args, k) != getattr(default_trainer_args, k))


if __name__ == "__main__":
    now = datetime.datetime.now().strftime("%Y-%m-%dT%H-%M-%S")
    # add
    local_rank = int(os.environ.get('LOCAL_RANK', 0))
    global_rank = int(os.environ.get('RANK', 0))
    num_rank = int(os.environ.get('WORLD_SIZE', 0))
    # end
    # local_rank = int(os.environ.get('LOCAL_RANK'))
    # global_rank = int(os.environ.get('RANK'))
    # num_rank = int(os.environ.get('WORLD_SIZE'))

    parser = get_parser()
    ## Extends existing argparse by default Trainer attributes
    parser = Trainer.add_argparse_args(parser)
    args, unknown = parser.parse_known_args()
    ## disable transformer warning
    transf_logging.set_verbosity_error()
    seed_everything(args.seed)

    ## yaml configs: "model" | "data" | "lightning"
    configs = [OmegaConf.load(cfg) for cfg in args.base]
    # # add
    # configs = []
    # for cfg in args.base:
    #     config = OmegaConf.load(cfg)
    #     configs.append(config)

    cli = OmegaConf.from_dotlist(unknown)
    config = OmegaConf.merge(*configs, cli)
    lightning_config = config.pop("lightning", OmegaConf.create())
    trainer_config = lightning_config.get("trainer", OmegaConf.create()) 

    ## setup workspace directories
    workdir, ckptdir, cfgdir, loginfo = init_workspace(args.name, args.logdir, config, lightning_config, global_rank)
    logger = set_logger(logfile=os.path.join(loginfo, 'log_%d:%s.txt'%(global_rank, now)))
    logger.info("@lightning version: %s [>=1.8 required]"%(pl.__version__))  

    ## MODEL CONFIG >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
    logger.info("***** Configing Model *****")
    config.model.params.logdir = workdir
    model = instantiate_from_config(config.model)

    ## load checkpoints
    model = load_checkpoints(model, config.model)

    ## register_schedule again to make ZTSNR work
    if model.rescale_betas_zero_snr:
        model.register_schedule(given_betas=model.given_betas, beta_schedule=model.beta_schedule, timesteps=model.timesteps,
                                linear_start=model.linear_start, linear_end=model.linear_end, cosine_s=model.cosine_s)

    ## update trainer config
    for k in get_nondefault_trainer_args(args):
        trainer_config[k] = getattr(args, k)
        
    # num_nodes = trainer_config.num_nodes
    # ngpu_per_node = trainer_config.devices
    # add
    num_nodes = 1
    ngpu_per_node = 1
    logger.info(f"Running on {num_rank}={num_nodes}x{ngpu_per_node} GPUs")

    ## setup learning rate
    base_lr = config.model.base_learning_rate
    bs = config.data.params.batch_size
    if getattr(config.model, 'scale_lr', True):
        model.learning_rate = num_rank * bs * base_lr
    else:
        model.learning_rate = base_lr


    ## DATA CONFIG >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
    logger.info("***** Configing Data *****")
    data = instantiate_from_config(config.data)
    data.setup()
    for k in data.datasets:
        logger.info(f"{k}, {data.datasets[k].__class__.__name__}, {len(data.datasets[k])}")


    ## TRAINER CONFIG >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
    logger.info("***** Configing Trainer *****")
    if "accelerator" not in trainer_config:
        trainer_config["accelerator"] = "gpu"

    ## setup trainer args: pl-logger and callbacks
    trainer_kwargs = dict()
    trainer_kwargs["num_sanity_val_steps"] = 0
    logger_cfg = get_trainer_logger(lightning_config, workdir, args.debug)
    trainer_kwargs["logger"] = instantiate_from_config(logger_cfg)
    
    ## setup callbacks
    callbacks_cfg = get_trainer_callbacks(lightning_config, config, workdir, ckptdir, logger)
    trainer_kwargs["callbacks"] = [instantiate_from_config(callbacks_cfg[k]) for k in callbacks_cfg]
    strategy_cfg = get_trainer_strategy(lightning_config)
    trainer_kwargs["strategy"] = strategy_cfg if type(strategy_cfg) == str else instantiate_from_config(strategy_cfg)
    trainer_kwargs['precision'] = lightning_config.get('precision', 32)
    trainer_kwargs["sync_batchnorm"] = False

    ## trainer config: others

    trainer_args = argparse.Namespace(**trainer_config)
    trainer = Trainer.from_argparse_args(trainer_args, **trainer_kwargs)

    ## allow checkpointing via USR1
    def melk(*args, **kwargs):
        ## run all checkpoint hooks
        if trainer.global_rank == 0:
            print("Summoning checkpoint.")
            ckpt_path = os.path.join(ckptdir, "last_summoning.ckpt")
            trainer.save_checkpoint(ckpt_path)

    def divein(*args, **kwargs):
        if trainer.global_rank == 0:
            import pudb;
            pudb.set_trace()

    import signal
    signal.signal(signal.SIGUSR1, melk)
    signal.signal(signal.SIGUSR2, divein)

    ## Running LOOP >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
    logger.info("***** Running the Loop *****")
    if args.train:
        try:
            if "strategy" in lightning_config and lightning_config['strategy'].startswith('deepspeed'):
                logger.info("<Training in DeepSpeed Mode>")
                ## deepspeed
                if trainer_kwargs['precision'] == 16:
                    with torch.cuda.amp.autocast():
                        trainer.fit(model, data)
                else:
                    trainer.fit(model, data)
            else:
                logger.info("<Training in DDPSharded Mode>") ## this is default
                ## ddpsharded
                trainer.fit(model, data)
        except Exception:
            #melk()
            raise

    # if args.val:
    #     trainer.validate(model, data)
    # if args.test or not trainer.interrupted:
    #     trainer.test(model, data)

Error Message:

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 360.00 MiB (GPU 0; 31.73 GiB total capacity; 29.84 GiB already allocated; 80.19 MiB free; 30.33 GiB reserved in total by PyTorch). 
If reserved memory is >> allocated memory, try setting max_split_size_mb to avoid fragmentation. 
See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF.

This error occurred at:

Epoch 0:   0%|          | 1/80000 [00:53<1195:43:13, 53.81s/it, loss=0.347, v_num=6, train/loss_simple_step=0.347, train/loss_vlb_step=0.347, train/loss_step=0.347]

Steps to Reproduce:

  1. Run trainer.py with a single 32GB GPU.
  2. Start the fine-tuning process.
  3. Error occurs shortly after the first epoch begins.

My Setup:

  • GPU: Single 32GB GPU

Solutions Tried:

  1. Reduced the batch size to minimize memory usage.

Thank you for your assistance!

xlnn avatar Oct 29 '24 08:10 xlnn

Hi What is your bs and your config.yaml? And can you try with more GPUs?

Doubiiu avatar Oct 29 '24 09:10 Doubiiu

Hi What is your bs and your config.yaml? And can you try with more GPUs?

Hello, thank you for your response.

  • Batch Size: My batch size is set to 1.

  • Config File (config.yaml): Below is the content of my config file:

    model:
      pretrained_checkpoint: /home/cherry2025/DynamiCrafter/checkpoints/dynamicrafter_1024_v1/model.ckpt
      base_learning_rate: 1.0e-05
      scale_lr: False
      target: lvdm.models.ddpm3d.LatentVisualDiffusion
      params:
        rescale_betas_zero_snr: True
        parameterization: "v"
        linear_start: 0.00085
        linear_end: 0.012
        num_timesteps_cond: 1
        log_every_t: 200
        timesteps: 1000
        first_stage_key: video
        cond_stage_key: caption
        cond_stage_trainable: False
        image_proj_model_trainable: True
        conditioning_key: hybrid
        image_size: [72, 128]
        channels: 4
        scale_by_std: False
        scale_factor: 0.18215
        use_ema: False
        uncond_prob: 0.05
        uncond_type: 'empty_seq'
        rand_cond_frame: true
        use_dynamic_rescale: true
        base_scale: 0.3
        fps_condition_type: 'fps'
        perframe_ae: True
        unet_config:
          target: lvdm.modules.networks.openaimodel3d.UNetModel
          params:
            in_channels: 8
            out_channels: 4
            model_channels: 320
            attention_resolutions: [4, 2, 1]
            num_res_blocks: 2
            channel_mult: [1, 2, 4, 4]
            dropout: 0.1
            num_head_channels: 64
            transformer_depth: 1
            context_dim: 1024
            use_linear: true
            use_checkpoint: True
            temporal_conv: True
            temporal_attention: True
            temporal_selfatt_only: true
            use_relative_position: false
            use_causal_attention: False
            temporal_length: 16
            addition_attention: true
            image_cross_attention: true
            default_fs: 10
            fs_condition: true
        first_stage_config:
          target: lvdm.models.autoencoder.AutoencoderKL
          params:
            embed_dim: 4
            monitor: val/rec_loss
            ddconfig:
              double_z: True
              z_channels: 4
              resolution: 256
              in_channels: 3
              out_ch: 3
              ch: 128
              ch_mult: [1, 2, 4, 4]
              num_res_blocks: 2
              attn_resolutions: []
              dropout: 0.0
            lossconfig:
              target: torch.nn.Identity
        cond_stage_config:
          target: lvdm.modules.encoders.condition.FrozenOpenCLIPEmbedder
          params:
            freeze: true
            layer: "penultimate"
        img_cond_stage_config:
          target: lvdm.modules.encoders.condition.FrozenOpenCLIPImageEmbedderV2
          params:
            freeze: true
        image_proj_stage_config:
          target: lvdm.modules.encoders.resampler.Resampler
          params:
            dim: 1024
            depth: 4
            dim_head: 64
            heads: 12
            num_queries: 16
            embedding_dim: 1280
            output_dim: 1024
            ff_mult: 4
            video_length: 16
    data:
      target: utils_data.DataModuleFromConfig
      params:
        batch_size: 1
        num_workers: 2
        wrap: false
        train:
          target: lvdm.data.webvid.WebVid
          params:
            data_dir: "/home/cherry2025/DynamiCrafter/train_data/img1"
            meta_path: "/home/cherry2025/DynamiCrafter/train_data/webvid10m_mini_80k.csv"
            video_length: 16
            frame_stride: 6
            load_raw_resolution: true
            resolution: [1024]
            spatial_transform: resize_center_crop
            random_fs: true
    lightning:
      precision: 16
      trainer:
        benchmark: True
        accumulate_grad_batches: 2
        max_steps: 100000
        log_every_n_steps: 50
        val_check_interval: 0.5
        gradient_clip_algorithm: 'norm'
        gradient_clip_val: 0.5
      callbacks:
        model_checkpoint:
          target: pytorch_lightning.callbacks.ModelCheckpoint
          params:
            every_n_train_steps: 9000
            filename: "{epoch}-{step}"
            save_weights_only: True
        metrics_over_trainsteps_checkpoint:
          target: pytorch_lightning.callbacks.ModelCheckpoint
          params:
            filename: '{epoch}-{step}'
            save_weights_only: True
            every_n_train_steps: 10000
        batch_logger:
          target: callbacks.ImageLogger
          params:
            batch_frequency: 500
            to_local: False
            max_images: 8
            log_images_kwargs:
              ddim_steps: 50
              unconditional_guidance_scale: 7.5
              timestep_spacing: uniform_trailing
              guidance_rescale: 0.7
    
  • Dataset: I use the training dataset consists of 80,000 images.

  • GPU Setup: Currently, I have access to a server with three 24GB GPUs.

Would this configuration be sufficient for the training process with this dataset size?

Thank you for your assistance!

xlnn avatar Oct 29 '24 09:10 xlnn

Hi. I forget the exact hardware requirement for fine-tuning DynamiCrafter-1024 model. It may be hard for a single 32GB GPU... Multiple 32GB GPUs would be possible.

Doubiiu avatar Oct 29 '24 22:10 Doubiiu

Hi. I forget the exact hardware requirement for fine-tuning DynamiCrafter-1024 model. It may be hard for a single 32GB GPU... Multiple 32GB GPUs would be possible.

Thank you!

xlnn avatar Oct 30 '24 11:10 xlnn

Hi. I forget the exact hardware requirement for fine-tuning DynamiCrafter-1024 model. It may be hard for a single 32GB GPU... Multiple 32GB GPUs would be possible.

Hi,

I'm encountering a CUDA out of memory error while fine-tuning my model, even though I'm using 4 GPUs, each with 32GB of memory. Here’s the error message:

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 360.00 MiB (GPU 1; 31.73 GiB total capacity; 29.01 GiB already allocated; 102.19 MiB free; 31.14 GiB reserved in total by PyTorch)
If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
Traceback (most recent call last):
  File "./main/trainer.py", line 396, in <module>
    trainer.fit(model, data)

Configuration:

  • Using 4 GPUs (each with 32GB memory).
  • Running a fine-tuning task with PyTorch Lightning.

What I’ve Tried:

  1. Reducing batch_size – Lowering batch size to reduce memory usage. Batch_size is 1.

Despite these efforts, the error persists. Any insights into why this might be happening, or additional suggestions to troubleshoot?

xlnn avatar Oct 30 '24 14:10 xlnn

Hi Did you try to fine-tune the DynamiCrafter-512 model and check the memory usage?

Doubiiu avatar Oct 30 '24 17:10 Doubiiu

Hi Did you try to fine-tune the DynamiCrafter-512 model and check the memory usage?

Hi,I try to

xlnn avatar Oct 31 '24 09:10 xlnn

Hi Did you try to fine-tune the DynamiCrafter-512 model and check the memory usage?

Hello. The issue persists; what is going on? Thank you!

xlnn avatar Oct 31 '24 10:10 xlnn