CogVideo icon indicating copy to clipboard operation
CogVideo copied to clipboard

[loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 67108864, reducing to 33554432

Open chenxinli001 opened this issue 1 year ago • 11 comments
trafficstars

System Info / 系統信息

When I fine-tune CogVideoX-2B, i found that almost all the steps are skipped, and the loss scale is very large.

Information / 问题信息

  • [X] The official example scripts / 官方的示例脚本
  • [ ] My own modified scripts / 我自己修改的脚本和任务

Reproduction / 复现过程

just run:

#! /bin/bash

echo "RUN on hostname, CUDA_VISIBLE_DEVICES=$CUDA_VISIBLE_DEVICES"

environs="WORLD_SIZE=1 RANK=0 LOCAL_RANK=0 LOCAL_WORLD_SIZE=1"

run_cmd="$environs python train_video.py --base configs/cogvideox_2b_lora.yaml configs/sft.yaml --seed $RANDOM"

echo ${run_cmd} eval ${run_cmd}

echo "DONE on hostname"

Expected behavior / 期待表现

Is this normal?

chenxinli001 avatar Sep 03 '24 02:09 chenxinli001

nop, can you share the log?

zRzRzRzRzRzRzR avatar Sep 03 '24 02:09 zRzRzRzRzRzRzR

(cogvideo) ubuntu@instance-butter:/data3/cx_workspace/CogV/CogVideo/sat$ bash finetune_single_gpu.sh RUN on instance-butter, CUDA_VISIBLE_DEVICES=6 WORLD_SIZE=1 RANK=0 LOCAL_RANK=0 LOCAL_WORLD_SIZE=1 python train_video.py --base configs/cogvideox_2b_lora.yaml configs/sft.yaml --seed 22338 [2024-09-03 02:36:29,937] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect) [WARNING] async_io requires the dev libaio .so object and headers but these were not found. [WARNING] async_io: please install the libaio-dev package with apt [WARNING] If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found. [WARNING] Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH [WARNING] sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.4 [WARNING] using untested triton version (3.0.0), only 1.0.0 is known to be compatible /data1/anaconda3/envs/cogvideo/lib/python3.10/site-packages/deepspeed/runtime/zero/linear.py:49: FutureWarning: torch.cuda.amp.custom_fwd(args...) is deprecated. Please use torch.amp.custom_fwd(args..., device_type='cuda') instead. def forward(ctx, input, weight, bias=None): /data1/anaconda3/envs/cogvideo/lib/python3.10/site-packages/deepspeed/runtime/zero/linear.py:67: FutureWarning: torch.cuda.amp.custom_bwd(args...) is deprecated. Please use torch.amp.custom_bwd(args..., device_type='cuda') instead. def backward(ctx, grad_output): /data1/anaconda3/envs/cogvideo/lib/python3.10/site-packages/kornia/feature/lightglue.py:44: FutureWarning: torch.cuda.amp.custom_fwd(args...) is deprecated. Please use torch.amp.custom_fwd(args..., device_type='cuda') instead. @torch.cuda.amp.custom_fwd(cast_inputs=torch.float32) /data1/anaconda3/envs/cogvideo/lib/python3.10/site-packages/xformers/ops/fmha/flash.py:211: FutureWarning: torch.library.impl_abstract was renamed to torch.library.register_fake. Please use that instead; we will remove torch.library.impl_abstract in a future version of PyTorch. @torch.library.impl_abstract("xformers_flash::flash_fwd") /data1/anaconda3/envs/cogvideo/lib/python3.10/site-packages/xformers/ops/fmha/flash.py:344: FutureWarning: torch.library.impl_abstract was renamed to torch.library.register_fake. Please use that instead; we will remove torch.library.impl_abstract in a future version of PyTorch. @torch.library.impl_abstract("xformers_flash::flash_bwd") [2024-09-03 02:36:34,890] [INFO] using world size: 1 [2024-09-03 02:36:34,891] [INFO] Will override arguments with manually specified deepspeed_config! [2024-09-03 02:36:34,893] [INFO] [RANK 0] > initializing model parallel with size 1 [2024-09-03 02:36:34,894] [INFO] [comm.py:637:init_distributed] cdb=None [2024-09-03 02:36:34,922] [INFO] [RANK 0] building SATVideoDiffusionEngine model ... [2024-09-03 02:36:44,771] [INFO] [RANK 0] replacing layer 0 attention with lora [2024-09-03 02:36:44,904] [INFO] [RANK 0] replacing layer 1 attention with lora [2024-09-03 02:36:45,021] [INFO] [RANK 0] replacing layer 2 attention with lora [2024-09-03 02:36:45,131] [INFO] [RANK 0] replacing layer 3 attention with lora [2024-09-03 02:36:45,241] [INFO] [RANK 0] replacing layer 4 attention with lora [2024-09-03 02:36:45,352] [INFO] [RANK 0] replacing layer 5 attention with lora [2024-09-03 02:36:45,466] [INFO] [RANK 0] replacing layer 6 attention with lora [2024-09-03 02:36:45,575] [INFO] [RANK 0] replacing layer 7 attention with lora [2024-09-03 02:36:45,687] [INFO] [RANK 0] replacing layer 8 attention with lora [2024-09-03 02:36:45,851] [INFO] [RANK 0] replacing layer 9 attention with lora [2024-09-03 02:36:45,957] [INFO] [RANK 0] replacing layer 10 attention with lora [2024-09-03 02:36:46,063] [INFO] [RANK 0] replacing layer 11 attention with lora [2024-09-03 02:36:46,173] [INFO] [RANK 0] replacing layer 12 attention with lora [2024-09-03 02:36:46,280] [INFO] [RANK 0] replacing layer 13 attention with lora [2024-09-03 02:36:46,387] [INFO] [RANK 0] replacing layer 14 attention with lora [2024-09-03 02:36:46,495] [INFO] [RANK 0] replacing layer 15 attention with lora [2024-09-03 02:36:46,606] [INFO] [RANK 0] replacing layer 16 attention with lora [2024-09-03 02:36:46,761] [INFO] [RANK 0] replacing layer 17 attention with lora [2024-09-03 02:36:46,901] [INFO] [RANK 0] replacing layer 18 attention with lora [2024-09-03 02:36:47,044] [INFO] [RANK 0] replacing layer 19 attention with lora [2024-09-03 02:36:47,171] [INFO] [RANK 0] replacing layer 20 attention with lora [2024-09-03 02:36:47,291] [INFO] [RANK 0] replacing layer 21 attention with lora [2024-09-03 02:36:47,397] [INFO] [RANK 0] replacing layer 22 attention with lora [2024-09-03 02:36:47,506] [INFO] [RANK 0] replacing layer 23 attention with lora [2024-09-03 02:36:47,610] [INFO] [RANK 0] replacing layer 24 attention with lora [2024-09-03 02:36:47,774] [INFO] [RANK 0] replacing layer 25 attention with lora [2024-09-03 02:36:47,881] [INFO] [RANK 0] replacing layer 26 attention with lora [2024-09-03 02:36:47,986] [INFO] [RANK 0] replacing layer 27 attention with lora [2024-09-03 02:36:48,095] [INFO] [RANK 0] replacing layer 28 attention with lora [2024-09-03 02:36:48,208] [INFO] [RANK 0] replacing layer 29 attention with lora Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:04<00:00, 2.28s/it] Initialized embedder #0: FrozenT5Embedder with 4762310656 params. Trainable: False Working with z of shape (1, 16, 32, 32) = 16384 dimensions. /data3/cx_workspace/CogV/CogVideo/sat/vae_modules/autoencoder.py:565: FutureWarning: You are using torch.load with weights_only=False (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for weights_only will be flipped to True. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via torch.serialization.add_safe_globals. We recommend you start setting weights_only=True for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature. sd = torch.load(path, map_location="cpu")["state_dict"] Deleting key loss.logvar from state_dict. Deleting key loss.perceptual_loss.scaling_layer.shift from state_dict. Deleting key loss.perceptual_loss.scaling_layer.scale from state_dict. Deleting key loss.perceptual_loss.net.slice1.0.weight from state_dict. Deleting key loss.perceptual_loss.net.slice1.0.bias from state_dict. Deleting key loss.perceptual_loss.net.slice1.2.weight from state_dict. Deleting key loss.perceptual_loss.net.slice1.2.bias from state_dict. Deleting key loss.perceptual_loss.net.slice2.5.weight from state_dict. Deleting key loss.perceptual_loss.net.slice2.5.bias from state_dict. Deleting key loss.perceptual_loss.net.slice2.7.weight from state_dict. Deleting key loss.perceptual_loss.net.slice2.7.bias from state_dict. Deleting key loss.perceptual_loss.net.slice3.10.weight from state_dict. Deleting key loss.perceptual_loss.net.slice3.10.bias from state_dict. Deleting key loss.perceptual_loss.net.slice3.12.weight from state_dict. Deleting key loss.perceptual_loss.net.slice3.12.bias from state_dict. Deleting key loss.perceptual_loss.net.slice3.14.weight from state_dict. Deleting key loss.perceptual_loss.net.slice3.14.bias from state_dict. Deleting key loss.perceptual_loss.net.slice4.17.weight from state_dict. Deleting key loss.perceptual_loss.net.slice4.17.bias from state_dict. Deleting key loss.perceptual_loss.net.slice4.19.weight from state_dict. Deleting key loss.perceptual_loss.net.slice4.19.bias from state_dict. Deleting key loss.perceptual_loss.net.slice4.21.weight from state_dict. Deleting key loss.perceptual_loss.net.slice4.21.bias from state_dict. Deleting key loss.perceptual_loss.net.slice5.24.weight from state_dict. Deleting key loss.perceptual_loss.net.slice5.24.bias from state_dict. Deleting key loss.perceptual_loss.net.slice5.26.weight from state_dict. Deleting key loss.perceptual_loss.net.slice5.26.bias from state_dict. Deleting key loss.perceptual_loss.net.slice5.28.weight from state_dict. Deleting key loss.perceptual_loss.net.slice5.28.bias from state_dict. Deleting key loss.perceptual_loss.lin0.model.1.weight from state_dict. Deleting key loss.perceptual_loss.lin1.model.1.weight from state_dict. Deleting key loss.perceptual_loss.lin2.model.1.weight from state_dict. Deleting key loss.perceptual_loss.lin3.model.1.weight from state_dict. Deleting key loss.perceptual_loss.lin4.model.1.weight from state_dict. Deleting key loss.discriminator.blocks.0.downsample_res.conv.weight from state_dict. Deleting key loss.discriminator.blocks.0.downsample_res.conv.bias from state_dict. Deleting key loss.discriminator.blocks.0.net.0.conv.weight from state_dict. Deleting key loss.discriminator.blocks.0.net.0.conv.bias from state_dict. Deleting key loss.discriminator.blocks.0.net.2.conv.weight from state_dict. Deleting key loss.discriminator.blocks.0.net.2.conv.bias from state_dict. Deleting key loss.discriminator.blocks.0.downsample.conv.weight from state_dict. Deleting key loss.discriminator.blocks.0.downsample.conv.bias from state_dict. Deleting key loss.discriminator.blocks.1.downsample_res.conv.weight from state_dict. Deleting key loss.discriminator.blocks.1.downsample_res.conv.bias from state_dict. Deleting key loss.discriminator.blocks.1.net.0.conv.weight from state_dict. Deleting key loss.discriminator.blocks.1.net.0.conv.bias from state_dict. Deleting key loss.discriminator.blocks.1.net.2.conv.weight from state_dict. Deleting key loss.discriminator.blocks.1.net.2.conv.bias from state_dict. Deleting key loss.discriminator.blocks.1.downsample.conv.weight from state_dict. Deleting key loss.discriminator.blocks.1.downsample.conv.bias from state_dict. Deleting key loss.discriminator.blocks.2.downsample_res.conv.weight from state_dict. Deleting key loss.discriminator.blocks.2.downsample_res.conv.bias from state_dict. Deleting key loss.discriminator.blocks.2.net.0.conv.weight from state_dict. Deleting key loss.discriminator.blocks.2.net.0.conv.bias from state_dict. Deleting key loss.discriminator.blocks.2.net.2.conv.weight from state_dict. Deleting key loss.discriminator.blocks.2.net.2.conv.bias from state_dict. Deleting key loss.discriminator.blocks.2.downsample.conv.weight from state_dict. Deleting key loss.discriminator.blocks.2.downsample.conv.bias from state_dict. Deleting key loss.discriminator.blocks.3.downsample_res.conv.weight from state_dict. Deleting key loss.discriminator.blocks.3.downsample_res.conv.bias from state_dict. Deleting key loss.discriminator.blocks.3.net.0.conv.weight from state_dict. Deleting key loss.discriminator.blocks.3.net.0.conv.bias from state_dict. Deleting key loss.discriminator.blocks.3.net.2.conv.weight from state_dict. Deleting key loss.discriminator.blocks.3.net.2.conv.bias from state_dict. Deleting key loss.discriminator.blocks.3.downsample.conv.weight from state_dict. Deleting key loss.discriminator.blocks.3.downsample.conv.bias from state_dict. Deleting key loss.discriminator.blocks.4.0.conv_res.weight from state_dict. Deleting key loss.discriminator.blocks.4.0.conv_res.bias from state_dict. Deleting key loss.discriminator.blocks.4.0.net.0.weight from state_dict. Deleting key loss.discriminator.blocks.4.0.net.0.bias from state_dict. Deleting key loss.discriminator.blocks.4.0.net.2.weight from state_dict. Deleting key loss.discriminator.blocks.4.0.net.2.bias from state_dict. Deleting key loss.discriminator.blocks.4.0.downsample.1.weight from state_dict. Deleting key loss.discriminator.blocks.4.0.downsample.1.bias from state_dict. Deleting key loss.discriminator.blocks.4.1.0.fn.norm.gamma from state_dict. Deleting key loss.discriminator.blocks.4.1.0.fn.attn.to_q.0.weight from state_dict. Deleting key loss.discriminator.blocks.4.1.0.fn.attn.to_kv.0.weight from state_dict. Deleting key loss.discriminator.blocks.4.1.0.fn.attn.to_out.0.weight from state_dict. Deleting key loss.discriminator.blocks.4.1.1.fn.norm.gamma from state_dict. Deleting key loss.discriminator.blocks.4.1.1.fn.net.0.weight from state_dict. Deleting key loss.discriminator.blocks.4.1.1.fn.net.0.bias from state_dict. Deleting key loss.discriminator.blocks.4.1.1.fn.net.2.weight from state_dict. Deleting key loss.discriminator.blocks.4.1.1.fn.net.2.bias from state_dict. Deleting key loss.discriminator.blocks.5.0.conv_res.weight from state_dict. Deleting key loss.discriminator.blocks.5.0.conv_res.bias from state_dict. Deleting key loss.discriminator.blocks.5.0.net.0.weight from state_dict. Deleting key loss.discriminator.blocks.5.0.net.0.bias from state_dict. Deleting key loss.discriminator.blocks.5.0.net.2.weight from state_dict. Deleting key loss.discriminator.blocks.5.0.net.2.bias from state_dict. Deleting key loss.discriminator.blocks.5.0.downsample.1.weight from state_dict. Deleting key loss.discriminator.blocks.5.0.downsample.1.bias from state_dict. Deleting key loss.discriminator.blocks.5.1.0.fn.norm.gamma from state_dict. Deleting key loss.discriminator.blocks.5.1.0.fn.attn.to_q.0.weight from state_dict. Deleting key loss.discriminator.blocks.5.1.0.fn.attn.to_kv.0.weight from state_dict. Deleting key loss.discriminator.blocks.5.1.0.fn.attn.to_out.0.weight from state_dict. Deleting key loss.discriminator.blocks.5.1.1.fn.norm.gamma from state_dict. Deleting key loss.discriminator.blocks.5.1.1.fn.net.0.weight from state_dict. Deleting key loss.discriminator.blocks.5.1.1.fn.net.0.bias from state_dict. Deleting key loss.discriminator.blocks.5.1.1.fn.net.2.weight from state_dict. Deleting key loss.discriminator.blocks.5.1.1.fn.net.2.bias from state_dict. Deleting key loss.discriminator.blocks.6.0.conv_res.weight from state_dict. Deleting key loss.discriminator.blocks.6.0.conv_res.bias from state_dict. Deleting key loss.discriminator.blocks.6.0.net.0.weight from state_dict. Deleting key loss.discriminator.blocks.6.0.net.0.bias from state_dict. Deleting key loss.discriminator.blocks.6.0.net.2.weight from state_dict. Deleting key loss.discriminator.blocks.6.0.net.2.bias from state_dict. Deleting key loss.discriminator.blocks.6.1.0.fn.norm.gamma from state_dict. Deleting key loss.discriminator.blocks.6.1.0.fn.attn.to_q.0.weight from state_dict. Deleting key loss.discriminator.blocks.6.1.0.fn.attn.to_kv.0.weight from state_dict. Deleting key loss.discriminator.blocks.6.1.0.fn.attn.to_out.0.weight from state_dict. Deleting key loss.discriminator.blocks.6.1.1.fn.norm.gamma from state_dict. Deleting key loss.discriminator.blocks.6.1.1.fn.net.0.weight from state_dict. Deleting key loss.discriminator.blocks.6.1.1.fn.net.0.bias from state_dict. Deleting key loss.discriminator.blocks.6.1.1.fn.net.2.weight from state_dict. Deleting key loss.discriminator.blocks.6.1.1.fn.net.2.bias from state_dict. Deleting key loss.discriminator.to_logits.0.weight from state_dict. Deleting key loss.discriminator.to_logits.0.bias from state_dict. Deleting key loss.discriminator.to_logits.3.weight from state_dict. Deleting key loss.discriminator.to_logits.3.bias from state_dict. Missing keys: [] Unexpected keys: [] Restored from CogVideoX-2b-sat/vae/3d-vae.pt [2024-09-03 02:36:56,856] [INFO] [RANK 0] > number of parameters on model parallel rank 0: 6764790755 [2024-09-03 02:37:15,810] [INFO] [RANK 0] global rank 0 is loading checkpoint CogVideoX-2b-sat/transformer/1000/mp_rank_00_model_states.pt /data1/anaconda3/envs/cogvideo/lib/python3.10/site-packages/sat/training/model_io.py:286: FutureWarning: You are using torch.load with weights_only=False (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for weights_only will be flipped to True. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via torch.serialization.add_safe_globals. We recommend you start setting weights_only=True for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature. sd = torch.load(checkpoint_name, map_location='cpu') [2024-09-03 02:37:17,758] [INFO] [RANK 0] > successfully loaded CogVideoX-2b-sat/transformer/1000/mp_rank_00_model_states.pt [2024-09-03 02:37:18,437] [INFO] [RANK 0] ***** Total trainable parameters: 58982400 ***** [2024-09-03 02:37:18,437] [INFO] [RANK 0] [<class 'sat.ops.layernorm.LayerNorm'>, <class 'torch.nn.modules.normalization.LayerNorm'>, <class 'sat.ops.layernorm.RMSNorm'>] is set to no_weight_decay [2024-09-03 02:37:18,440] [INFO] [RANK 0] Syncing initialized parameters... [2024-09-03 02:37:18,503] [INFO] [RANK 0] Finished syncing initialized parameters. [2024-09-03 02:37:18,503] [INFO] [RANK 0] Using optimizer sat.ops.FusedEmaAdam from sat. [2024-09-03 02:37:18,503] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed info: version=0.14.4, git-hash=unknown, git-branch=unknown [2024-09-03 02:37:18,503] [WARNING] [config_utils.py:69:_process_deprecated_field] Config parameter cpu_offload is deprecated use offload_optimizer instead [2024-09-03 02:37:18,646] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Flops Profiler Enabled: False Using /data1/.cache/torch_extensions/py310_cu121 as PyTorch extensions root... Detected CUDA files, patching ldflags Emitting ninja build file /data1/.cache/torch_extensions/py310_cu121/fused_ema_adam/build.ninja... /data1/anaconda3/envs/cogvideo/lib/python3.10/site-packages/torch/utils/cpp_extension.py:1965: UserWarning: TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST']. warnings.warn( Building extension module fused_ema_adam... Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N) ninja: no work to do. Loading extension module fused_ema_adam... Time to load fused_ema_adam op: 0.07278060913085938 seconds [2024-09-03 02:37:18,724] [INFO] [logging.py:96:log_dist] [Rank 0] Using client callable to create basic optimizer [2024-09-03 02:37:18,725] [INFO] [logging.py:96:log_dist] [Rank 0] Removing param_group that has no 'params' in the basic Optimizer [2024-09-03 02:37:18,762] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Basic Optimizer = FusedEmaAdam [2024-09-03 02:37:18,763] [INFO] [utils.py:56:is_zero_supported_optimizer] Checking ZeRO support for optimizer=FusedEmaAdam type=<class 'sat.ops.fused_ema_adam.FusedEmaAdam'> [2024-09-03 02:37:18,763] [WARNING] [engine.py:1179:do_optimizer_sanity_check] **** You are using ZeRO with an untested optimizer, proceed with caution ***** [2024-09-03 02:37:18,763] [INFO] [logging.py:96:log_dist] [Rank 0] Creating torch.float16 ZeRO stage 2 optimizer [2024-09-03 02:37:18,763] [INFO] [stage_1_and_2.py:148:init] Reduce bucket size 1000000000 [2024-09-03 02:37:18,763] [INFO] [stage_1_and_2.py:149:init] Allgather bucket size 1000000000 [2024-09-03 02:37:18,763] [INFO] [stage_1_and_2.py:150:init] CPU Offload: False [2024-09-03 02:37:18,763] [INFO] [stage_1_and_2.py:151:init] Round robin gradient partitioning: False [2024-09-03 02:37:23,295] [INFO] [utils.py:781:see_memory_usage] Before initializing optimizer states [2024-09-03 02:37:23,295] [INFO] [utils.py:782:see_memory_usage] MA 12.86 GB Max_MA 12.97 GB CA 13.23 GB Max_CA 13 GB [2024-09-03 02:37:23,295] [INFO] [utils.py:789:see_memory_usage] CPU Virtual Memory: used = 478.67 GB, percent = 23.7% [2024-09-03 02:37:23,814] [INFO] [utils.py:781:see_memory_usage] After initializing optimizer states [2024-09-03 02:37:23,814] [INFO] [utils.py:782:see_memory_usage] MA 12.86 GB Max_MA 13.08 GB CA 13.45 GB Max_CA 13 GB [2024-09-03 02:37:23,814] [INFO] [utils.py:789:see_memory_usage] CPU Virtual Memory: used = 483.06 GB, percent = 24.0% [2024-09-03 02:37:23,815] [INFO] [stage_1_and_2.py:543:init] optimizer state initialized [2024-09-03 02:37:24,129] [INFO] [utils.py:781:see_memory_usage] After initializing ZeRO optimizer [2024-09-03 02:37:24,130] [INFO] [utils.py:782:see_memory_usage] MA 12.86 GB Max_MA 12.86 GB CA 13.45 GB Max_CA 13 GB [2024-09-03 02:37:24,130] [INFO] [utils.py:789:see_memory_usage] CPU Virtual Memory: used = 485.83 GB, percent = 24.1% [2024-09-03 02:37:24,134] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Final Optimizer = DeepSpeedZeroOptimizer [2024-09-03 02:37:24,134] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed using client LR scheduler [2024-09-03 02:37:24,134] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed LR Scheduler = None [2024-09-03 02:37:24,134] [INFO] [logging.py:96:log_dist] [Rank 0] step=0, skipped=0, lr=[1.0], mom=[[0.9, 0.95]] [2024-09-03 02:37:24,137] [INFO] [config.py:997:print] DeepSpeedEngine configuration: [2024-09-03 02:37:24,137] [INFO] [config.py:1001:print] activation_checkpointing_config { "partition_activations": false, "contiguous_memory_optimization": false, "cpu_checkpointing": false, "number_checkpoints": null, "synchronize_checkpoint_boundary": false, "profile": false } [2024-09-03 02:37:24,137] [INFO] [config.py:1001:print] aio_config ................... {'block_size': 1048576, 'queue_depth': 8, 'thread_count': 1, 'single_submit': False, 'overlap_events': True} [2024-09-03 02:37:24,137] [INFO] [config.py:1001:print] amp_enabled .................. False [2024-09-03 02:37:24,137] [INFO] [config.py:1001:print] amp_params ................... False [2024-09-03 02:37:24,137] [INFO] [config.py:1001:print] autotuning_config ............ { "enabled": false, "start_step": null, "end_step": null, "metric_path": null, "arg_mappings": null, "metric": "throughput", "model_info": null, "results_dir": "autotuning_results", "exps_dir": "autotuning_exps", "overwrite": true, "fast": true, "start_profile_step": 3, "end_profile_step": 5, "tuner_type": "gridsearch", "tuner_early_stopping": 5, "tuner_num_trials": 50, "model_info_path": null, "mp_size": 1, "max_train_batch_size": null, "min_train_batch_size": 1, "max_train_micro_batch_size_per_gpu": 1.024000e+03, "min_train_micro_batch_size_per_gpu": 1, "num_tuning_micro_batch_sizes": 3 } [2024-09-03 02:37:24,137] [INFO] [config.py:1001:print] bfloat16_enabled ............. False [2024-09-03 02:37:24,137] [INFO] [config.py:1001:print] bfloat16_immediate_grad_update False [2024-09-03 02:37:24,137] [INFO] [config.py:1001:print] checkpoint_parallel_write_pipeline False [2024-09-03 02:37:24,137] [INFO] [config.py:1001:print] checkpoint_tag_validation_enabled True [2024-09-03 02:37:24,137] [INFO] [config.py:1001:print] checkpoint_tag_validation_fail False [2024-09-03 02:37:24,137] [INFO] [config.py:1001:print] comms_config ................. <deepspeed.comm.config.DeepSpeedCommsConfig object at 0x15507917fbb0> [2024-09-03 02:37:24,137] [INFO] [config.py:1001:print] communication_data_type ...... None [2024-09-03 02:37:24,138] [INFO] [config.py:1001:print] compression_config ........... {'weight_quantization': {'shared_parameters': {'enabled': False, 'quantizer_kernel': False, 'schedule_offset': 0, 'quantize_groups': 1, 'quantize_verbose': False, 'quantization_type': 'symmetric', 'quantize_weight_in_forward': False, 'rounding': 'nearest', 'fp16_mixed_quantize': False, 'quantize_change_ratio': 0.001}, 'different_groups': {}}, 'activation_quantization': {'shared_parameters': {'enabled': False, 'quantization_type': 'symmetric', 'range_calibration': 'dynamic', 'schedule_offset': 1000}, 'different_groups': {}}, 'sparse_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'row_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'head_pruning': {'shared_parameters': {'enabled': False, 'method': 'topk', 'schedule_offset': 1000}, 'different_groups': {}}, 'channel_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'layer_reduction': {'enabled': False}} [2024-09-03 02:37:24,138] [INFO] [config.py:1001:print] curriculum_enabled_legacy .... False [2024-09-03 02:37:24,138] [INFO] [config.py:1001:print] curriculum_params_legacy ..... False [2024-09-03 02:37:24,138] [INFO] [config.py:1001:print] data_efficiency_config ....... {'enabled': False, 'seed': 1234, 'data_sampling': {'enabled': False, 'num_epochs': 1000, 'num_workers': 0, 'curriculum_learning': {'enabled': False}}, 'data_routing': {'enabled': False, 'random_ltd': {'enabled': False, 'layer_token_lr_schedule': {'enabled': False}}}} [2024-09-03 02:37:24,138] [INFO] [config.py:1001:print] data_efficiency_enabled ...... False [2024-09-03 02:37:24,138] [INFO] [config.py:1001:print] dataloader_drop_last ......... False [2024-09-03 02:37:24,138] [INFO] [config.py:1001:print] disable_allgather ............ False [2024-09-03 02:37:24,138] [INFO] [config.py:1001:print] dump_state ................... False [2024-09-03 02:37:24,138] [INFO] [config.py:1001:print] dynamic_loss_scale_args ...... None [2024-09-03 02:37:24,138] [INFO] [config.py:1001:print] eigenvalue_enabled ........... False [2024-09-03 02:37:24,138] [INFO] [config.py:1001:print] eigenvalue_gas_boundary_resolution 1 [2024-09-03 02:37:24,138] [INFO] [config.py:1001:print] eigenvalue_layer_name ........ bert.encoder.layer [2024-09-03 02:37:24,138] [INFO] [config.py:1001:print] eigenvalue_layer_num ......... 0 [2024-09-03 02:37:24,138] [INFO] [config.py:1001:print] eigenvalue_max_iter .......... 100 [2024-09-03 02:37:24,138] [INFO] [config.py:1001:print] eigenvalue_stability ......... 1e-06 [2024-09-03 02:37:24,138] [INFO] [config.py:1001:print] eigenvalue_tol ............... 0.01 [2024-09-03 02:37:24,138] [INFO] [config.py:1001:print] eigenvalue_verbose ........... False [2024-09-03 02:37:24,138] [INFO] [config.py:1001:print] elasticity_enabled ........... False [2024-09-03 02:37:24,138] [INFO] [config.py:1001:print] flops_profiler_config ........ { "enabled": false, "recompute_fwd_factor": 0.0, "profile_step": 1, "module_depth": -1, "top_modules": 1, "detailed": true, "output_file": null } [2024-09-03 02:37:24,138] [INFO] [config.py:1001:print] fp16_auto_cast ............... False [2024-09-03 02:37:24,138] [INFO] [config.py:1001:print] fp16_enabled ................. True [2024-09-03 02:37:24,138] [INFO] [config.py:1001:print] fp16_master_weights_and_gradients False [2024-09-03 02:37:24,138] [INFO] [config.py:1001:print] global_rank .................. 0 [2024-09-03 02:37:24,138] [INFO] [config.py:1001:print] grad_accum_dtype ............. None [2024-09-03 02:37:24,138] [INFO] [config.py:1001:print] gradient_accumulation_steps .. 1 [2024-09-03 02:37:24,138] [INFO] [config.py:1001:print] gradient_clipping ............ 0.1 [2024-09-03 02:37:24,138] [INFO] [config.py:1001:print] gradient_predivide_factor .... 1.0 [2024-09-03 02:37:24,138] [INFO] [config.py:1001:print] graph_harvesting ............. False [2024-09-03 02:37:24,138] [INFO] [config.py:1001:print] hybrid_engine ................ enabled=False max_out_tokens=512 inference_tp_size=1 release_inference_cache=False pin_parameters=True tp_gather_partition_size=8 [2024-09-03 02:37:24,138] [INFO] [config.py:1001:print] initial_dynamic_scale ........ 65536 [2024-09-03 02:37:24,138] [INFO] [config.py:1001:print] load_universal_checkpoint .... False [2024-09-03 02:37:24,138] [INFO] [config.py:1001:print] loss_scale ................... 0 [2024-09-03 02:37:24,138] [INFO] [config.py:1001:print] memory_breakdown ............. False [2024-09-03 02:37:24,138] [INFO] [config.py:1001:print] mics_hierarchial_params_gather False [2024-09-03 02:37:24,138] [INFO] [config.py:1001:print] mics_shard_size .............. -1 [2024-09-03 02:37:24,138] [INFO] [config.py:1001:print] monitor_config ............... tensorboard=TensorBoardConfig(enabled=False, output_path='', job_name='DeepSpeedJobName') comet=CometConfig(enabled=False, samples_log_interval=100, project=None, workspace=None, api_key=None, experiment_name=None, experiment_key=None, online=None, mode=None) wandb=WandbConfig(enabled=False, group=None, team=None, project='deepspeed') csv_monitor=CSVConfig(enabled=False, output_path='', job_name='DeepSpeedJobName') enabled=False [2024-09-03 02:37:24,138] [INFO] [config.py:1001:print] nebula_config ................ { "enabled": false, "persistent_storage_path": null, "persistent_time_interval": 100, "num_of_version_in_retention": 2, "enable_nebula_load": true, "load_path": null } [2024-09-03 02:37:24,138] [INFO] [config.py:1001:print] optimizer_legacy_fusion ...... False [2024-09-03 02:37:24,138] [INFO] [config.py:1001:print] optimizer_name ............... None [2024-09-03 02:37:24,138] [INFO] [config.py:1001:print] optimizer_params ............. None [2024-09-03 02:37:24,138] [INFO] [config.py:1001:print] pipeline ..................... {'stages': 'auto', 'partition': 'best', 'seed_layers': False, 'activation_checkpoint_interval': 0, 'pipe_partitioned': True, 'grad_partitioned': True} [2024-09-03 02:37:24,138] [INFO] [config.py:1001:print] pld_enabled .................. False [2024-09-03 02:37:24,138] [INFO] [config.py:1001:print] pld_params ................... False [2024-09-03 02:37:24,139] [INFO] [config.py:1001:print] prescale_gradients ........... False [2024-09-03 02:37:24,139] [INFO] [config.py:1001:print] scheduler_name ............... None [2024-09-03 02:37:24,139] [INFO] [config.py:1001:print] scheduler_params ............. None [2024-09-03 02:37:24,139] [INFO] [config.py:1001:print] seq_parallel_communication_data_type torch.float32 [2024-09-03 02:37:24,139] [INFO] [config.py:1001:print] sparse_attention ............. None [2024-09-03 02:37:24,139] [INFO] [config.py:1001:print] sparse_gradients_enabled ..... False [2024-09-03 02:37:24,139] [INFO] [config.py:1001:print] steps_per_print .............. 50 [2024-09-03 02:37:24,139] [INFO] [config.py:1001:print] timers_config ................ enabled=True synchronized=True [2024-09-03 02:37:24,139] [INFO] [config.py:1001:print] train_batch_size ............. 2 [2024-09-03 02:37:24,139] [INFO] [config.py:1001:print] train_micro_batch_size_per_gpu 2 [2024-09-03 02:37:24,139] [INFO] [config.py:1001:print] use_data_before_expert_parallel False [2024-09-03 02:37:24,139] [INFO] [config.py:1001:print] use_node_local_storage ....... False [2024-09-03 02:37:24,139] [INFO] [config.py:1001:print] wall_clock_breakdown ......... False [2024-09-03 02:37:24,139] [INFO] [config.py:1001:print] weight_quantization_config ... None [2024-09-03 02:37:24,139] [INFO] [config.py:1001:print] world_size ................... 1 [2024-09-03 02:37:24,139] [INFO] [config.py:1001:print] zero_allow_untested_optimizer True [2024-09-03 02:37:24,139] [INFO] [config.py:1001:print] zero_config .................. stage=2 contiguous_gradients=False reduce_scatter=True reduce_bucket_size=1000000000 use_multi_rank_bucket_allreduce=True allgather_partitions=True allgather_bucket_size=1000000000 overlap_comm=True load_from_fp32_weights=False elastic_checkpoint=False offload_param=None offload_optimizer=None sub_group_size=1,000,000,000 cpu_offload_param=None cpu_offload_use_pin_memory=None prefetch_bucket_size=50,000,000 param_persistence_threshold=100,000 model_persistence_threshold=sys.maxsize max_live_parameters=1,000,000,000 max_reuse_distance=1,000,000,000 gather_16bit_weights_on_model_save=False use_all_reduce_for_fetch_params=False stage3_gather_fp16_weights_on_model_save=False ignore_unused_parameters=True legacy_stage1=False round_robin_gradients=False zero_hpz_partition_size=1 zero_quantized_weights=False zero_quantized_nontrainable_weights=False zero_quantized_gradients=False mics_shard_size=-1 mics_hierarchical_params_gather=False memory_efficient_linear=True pipeline_loading_checkpoint=False override_module_apply=True [2024-09-03 02:37:24,139] [INFO] [config.py:1001:print] zero_enabled ................. True [2024-09-03 02:37:24,139] [INFO] [config.py:1001:print] zero_force_ds_cpu_optimizer .. True [2024-09-03 02:37:24,139] [INFO] [config.py:1001:print] zero_optimization_stage ...... 2 [2024-09-03 02:37:24,139] [INFO] [config.py:987:print_user_config] json = { "train_micro_batch_size_per_gpu": 2, "gradient_accumulation_steps": 1, "steps_per_print": 50, "gradient_clipping": 0.1, "zero_optimization": { "stage": 2, "cpu_offload": false, "contiguous_gradients": false, "overlap_comm": true, "reduce_scatter": true, "reduce_bucket_size": 1.000000e+09, "allgather_bucket_size": 1.000000e+09, "load_from_fp32_weights": false }, "zero_allow_untested_optimizer": true, "bf16": { "enabled": false }, "fp16": { "enabled": true }, "loss_scale": 0, "loss_scale_window": 400, "hysteresis": 2, "min_loss_scale": 1, "activation_checkpointing": { "partition_activations": false, "contiguous_memory_optimization": false }, "wall_clock_breakdown": false } [2024-09-03 02:37:24,139] [INFO] [RANK 0] learning rate decaying style linear, ratio 10.0 [2024-09-03 02:37:24,139] [INFO] [RANK 0] Finetuning Model... [2024-09-03 02:37:24,139] [INFO] [RANK 0] arguments: [2024-09-03 02:37:24,139] [INFO] [RANK 0] base ......................... ['configs/cogvideox_2b_lora.yaml', 'configs/sft.yaml'] [2024-09-03 02:37:24,139] [INFO] [RANK 0] model_parallel_size .......... 1 [2024-09-03 02:37:24,139] [INFO] [RANK 0] force_pretrain ............... False [2024-09-03 02:37:24,139] [INFO] [RANK 0] device ....................... 0 [2024-09-03 02:37:24,139] [INFO] [RANK 0] debug ........................ False [2024-09-03 02:37:24,139] [INFO] [RANK 0] log_image .................... True [2024-09-03 02:37:24,139] [INFO] [RANK 0] output_dir ................... samples [2024-09-03 02:37:24,139] [INFO] [RANK 0] input_dir .................... None [2024-09-03 02:37:24,139] [INFO] [RANK 0] input_type ................... cli [2024-09-03 02:37:24,139] [INFO] [RANK 0] input_file ................... input.txt [2024-09-03 02:37:24,139] [INFO] [RANK 0] final_size ................... 2048 [2024-09-03 02:37:24,140] [INFO] [RANK 0] sdedit ....................... False [2024-09-03 02:37:24,140] [INFO] [RANK 0] grid_num_rows ................ 1 [2024-09-03 02:37:24,140] [INFO] [RANK 0] force_inference .............. False [2024-09-03 02:37:24,140] [INFO] [RANK 0] lcm_steps .................... None [2024-09-03 02:37:24,140] [INFO] [RANK 0] sampling_num_frames .......... 32 [2024-09-03 02:37:24,140] [INFO] [RANK 0] sampling_fps ................. 8 [2024-09-03 02:37:24,140] [INFO] [RANK 0] only_save_latents ............ False [2024-09-03 02:37:24,140] [INFO] [RANK 0] only_log_video_latents ....... True [2024-09-03 02:37:24,140] [INFO] [RANK 0] latent_channels .............. 32 [2024-09-03 02:37:24,140] [INFO] [RANK 0] image2video .................. False [2024-09-03 02:37:24,140] [INFO] [RANK 0] experiment_name .............. example_data-09-03-02-36 [2024-09-03 02:37:24,140] [INFO] [RANK 0] train_iters .................. 1000 [2024-09-03 02:37:24,140] [INFO] [RANK 0] batch_size ................... 2 [2024-09-03 02:37:24,140] [INFO] [RANK 0] lr ........................... 0.001 [2024-09-03 02:37:24,140] [INFO] [RANK 0] mode ......................... finetune [2024-09-03 02:37:24,140] [INFO] [RANK 0] seed ......................... 22338 [2024-09-03 02:37:24,140] [INFO] [RANK 0] zero_stage ................... 0 [2024-09-03 02:37:24,140] [INFO] [RANK 0] checkpoint_activations ....... True [2024-09-03 02:37:24,140] [INFO] [RANK 0] checkpoint_num_layers ........ 1 [2024-09-03 02:37:24,140] [INFO] [RANK 0] checkpoint_skip_layers ....... 0 [2024-09-03 02:37:24,140] [INFO] [RANK 0] fp16 ......................... True [2024-09-03 02:37:24,140] [INFO] [RANK 0] bf16 ......................... False [2024-09-03 02:37:24,140] [INFO] [RANK 0] gradient_accumulation_steps .. 1 [2024-09-03 02:37:24,140] [INFO] [RANK 0] profiling .................... -1 [2024-09-03 02:37:24,140] [INFO] [RANK 0] epochs ....................... None [2024-09-03 02:37:24,140] [INFO] [RANK 0] log_interval ................. 20 [2024-09-03 02:37:24,140] [INFO] [RANK 0] summary_dir .................. [2024-09-03 02:37:24,140] [INFO] [RANK 0] save_args .................... False [2024-09-03 02:37:24,140] [INFO] [RANK 0] lr_decay_iters ............... None [2024-09-03 02:37:24,140] [INFO] [RANK 0] lr_decay_style ............... linear [2024-09-03 02:37:24,140] [INFO] [RANK 0] lr_decay_ratio ............... 0.1 [2024-09-03 02:37:24,140] [INFO] [RANK 0] warmup ....................... 0.01 [2024-09-03 02:37:24,140] [INFO] [RANK 0] weight_decay ................. 0.0001 [2024-09-03 02:37:24,140] [INFO] [RANK 0] save ......................... ckpts_2b/example_data-09-03-02-36 [2024-09-03 02:37:24,140] [INFO] [RANK 0] load ......................... CogVideoX-2b-sat/transformer [2024-09-03 02:37:24,140] [INFO] [RANK 0] force_train .................. True [2024-09-03 02:37:24,140] [INFO] [RANK 0] save_interval ................ 500 [2024-09-03 02:37:24,140] [INFO] [RANK 0] no_save_rng .................. False [2024-09-03 02:37:24,140] [INFO] [RANK 0] no_load_rng .................. True [2024-09-03 02:37:24,140] [INFO] [RANK 0] resume_dataloader ............ False [2024-09-03 02:37:24,141] [INFO] [RANK 0] distributed_backend .......... nccl [2024-09-03 02:37:24,141] [INFO] [RANK 0] local_rank ................... 0 [2024-09-03 02:37:24,141] [INFO] [RANK 0] exit_interval ................ None [2024-09-03 02:37:24,141] [INFO] [RANK 0] wandb ........................ False [2024-09-03 02:37:24,141] [INFO] [RANK 0] wandb_project_name ........... default_project [2024-09-03 02:37:24,141] [INFO] [RANK 0] eval_batch_size .............. 1 [2024-09-03 02:37:24,141] [INFO] [RANK 0] eval_iters ................... 1 [2024-09-03 02:37:24,141] [INFO] [RANK 0] eval_interval ................ 100 [2024-09-03 02:37:24,141] [INFO] [RANK 0] strict_eval .................. False [2024-09-03 02:37:24,141] [INFO] [RANK 0] train_data ................... ['toy_data'] [2024-09-03 02:37:24,141] [INFO] [RANK 0] train_data_weights ........... None [2024-09-03 02:37:24,141] [INFO] [RANK 0] iterable_dataset ............. False [2024-09-03 02:37:24,141] [INFO] [RANK 0] iterable_dataset_eval ........ [2024-09-03 02:37:24,141] [INFO] [RANK 0] batch_from_same_dataset ...... False [2024-09-03 02:37:24,141] [INFO] [RANK 0] valid_data ................... ['toy_data'] [2024-09-03 02:37:24,141] [INFO] [RANK 0] test_data .................... None [2024-09-03 02:37:24,141] [INFO] [RANK 0] split ........................ 1,0,0 [2024-09-03 02:37:24,141] [INFO] [RANK 0] num_workers .................. 8 [2024-09-03 02:37:24,141] [INFO] [RANK 0] block_size ................... 10000 [2024-09-03 02:37:24,141] [INFO] [RANK 0] prefetch_factor .............. 4 [2024-09-03 02:37:24,141] [INFO] [RANK 0] deepspeed .................... True [2024-09-03 02:37:24,141] [INFO] [RANK 0] deepspeed_config ............. {'train_micro_batch_size_per_gpu': 2, 'gradient_accumulation_steps': 1, 'steps_per_print': 50, 'gradient_clipping': 0.1, 'zero_optimization': {'stage': 2, 'cpu_offload': False, 'contiguous_gradients': False, 'overlap_comm': True, 'reduce_scatter': True, 'reduce_bucket_size': 1000000000, 'allgather_bucket_size': 1000000000, 'load_from_fp32_weights': False}, 'zero_allow_untested_optimizer': True, 'bf16': {'enabled': False}, 'fp16': {'enabled': True}, 'loss_scale': 0, 'loss_scale_window': 400, 'hysteresis': 2, 'min_loss_scale': 1, 'activation_checkpointing': {'partition_activations': False, 'contiguous_memory_optimization': False}, 'wall_clock_breakdown': False} [2024-09-03 02:37:24,141] [INFO] [RANK 0] deepscale .................... False [2024-09-03 02:37:24,141] [INFO] [RANK 0] deepscale_config ............. None [2024-09-03 02:37:24,141] [INFO] [RANK 0] model_config ................. {'scale_factor': 1.15258426, 'disable_first_stage_autocast': True, 'not_trainable_prefixes': ['all'], 'log_keys': ['txt'], 'denoiser_config': {'target': 'sgm.modules.diffusionmodules.denoiser.DiscreteDenoiser', 'params': {'num_idx': 1000, 'quantize_c_noise': False, 'weighting_config': {'target': 'sgm.modules.diffusionmodules.denoiser_weighting.EpsWeighting'}, 'scaling_config': {'target': 'sgm.modules.diffusionmodules.denoiser_scaling.VideoScaling'}, 'discretization_config': {'target': 'sgm.modules.diffusionmodules.discretizer.ZeroSNRDDPMDiscretization', 'params': {'shift_scale': 3.0}}}}, 'network_config': {'target': 'dit_video_concat.DiffusionTransformer', 'params': {'time_embed_dim': 512, 'elementwise_affine': True, 'num_frames': 49, 'time_compressed_rate': 4, 'latent_width': 90, 'latent_height': 60, 'num_layers': 30, 'patch_size': 2, 'in_channels': 16, 'out_channels': 16, 'hidden_size': 1920, 'adm_in_channels': 256, 'num_attention_heads': 30, 'transformer_args': {'checkpoint_activations': True, 'vocab_size': 1, 'max_sequence_length': 64, 'layernorm_order': 'pre', 'skip_init': False, 'model_parallel_size': 1, 'is_decoder': False, 'num_layers': 30, 'hidden_size': 1920, 'num_attention_heads': 30, 'parallel_output': True}, 'modules': {'pos_embed_config': {'target': 'dit_video_concat.Basic3DPositionEmbeddingMixin', 'params': {'text_length': 226, 'height_interpolation': 1.875, 'width_interpolation': 1.875}}, 'lora_config': {'target': 'sat.model.finetune.lora2.LoraMixin', 'params': {'r': 128}}, 'patch_embed_config': {'target': 'dit_video_concat.ImagePatchEmbeddingMixin', 'params': {'text_hidden_size': 4096}}, 'adaln_layer_config': {'target': 'dit_video_concat.AdaLNMixin', 'params': {'qk_ln': True}}, 'final_layer_config': {'target': 'dit_video_concat.FinalLayerMixin'}}, 'dtype': 'fp16'}}, 'conditioner_config': {'target': 'sgm.modules.GeneralConditioner', 'params': {'emb_models': [{'is_trainable': False, 'input_key': 'txt', 'ucg_rate': 0.1, 'target': 'sgm.modules.encoders.modules.FrozenT5Embedder', 'params': {'model_dir': 'CogVideoX-2b-sat/t5-v1_1-xxl', 'max_length': 226}}]}}, 'first_stage_config': {'target': 'vae_modules.autoencoder.VideoAutoencoderInferenceWrapper', 'params': {'cp_size': 1, 'ckpt_path': 'CogVideoX-2b-sat/vae/3d-vae.pt', 'ignore_keys': ['loss'], 'loss_config': {'target': 'torch.nn.Identity'}, 'regularizer_config': {'target': 'vae_modules.regularizers.DiagonalGaussianRegularizer'}, 'encoder_config': {'target': 'vae_modules.cp_enc_dec.ContextParallelEncoder3D', 'params': {'double_z': True, 'z_channels': 16, 'resolution': 256, 'in_channels': 3, 'out_ch': 3, 'ch': 128, 'ch_mult': [1, 2, 2, 4], 'attn_resolutions': [], 'num_res_blocks': 3, 'dropout': 0.0, 'gather_norm': True}}, 'decoder_config': {'target': 'vae_modules.cp_enc_dec.ContextParallelDecoder3D', 'params': {'double_z': True, 'z_channels': 16, 'resolution': 256, 'in_channels': 3, 'out_ch': 3, 'ch': 128, 'ch_mult': [1, 2, 2, 4], 'attn_resolutions': [], 'num_res_blocks': 3, 'dropout': 0.0, 'gather_norm': False}}}}, 'loss_fn_config': {'target': 'sgm.modules.diffusionmodules.loss.VideoDiffusionLoss', 'params': {'offset_noise_level': 0, 'sigma_sampler_config': {'target': 'sgm.modules.diffusionmodules.sigma_sampling.DiscreteSampling', 'params': {'uniform_sampling': True, 'num_idx': 1000, 'discretization_config': {'target': 'sgm.modules.diffusionmodules.discretizer.ZeroSNRDDPMDiscretization', 'params': {'shift_scale': 3.0}}}}}}, 'sampler_config': {'target': 'sgm.modules.diffusionmodules.sampling.VPSDEDPMPP2MSampler', 'params': {'num_steps': 50, 'verbose': True, 'discretization_config': {'target': 'sgm.modules.diffusionmodules.discretizer.ZeroSNRDDPMDiscretization', 'params': {'shift_scale': 3.0}}, 'guider_config': {'target': 'sgm.modules.diffusionmodules.guiders.DynamicCFG', 'params': {'scale': 6, 'exp': 5, 'num_steps': 50}}}}} [2024-09-03 02:37:24,141] [INFO] [RANK 0] data_config .................. {'target': 'data_video.SFTDataset', 'params': {'video_size': [480, 720], 'fps': 8, 'max_num_frames': 49, 'skip_frms_num': 3.0}} [2024-09-03 02:37:24,141] [INFO] [RANK 0] cuda ......................... True [2024-09-03 02:37:24,142] [INFO] [RANK 0] rank ......................... 0 [2024-09-03 02:37:24,142] [INFO] [RANK 0] world_size ................... 1 [2024-09-03 02:37:24,142] [INFO] [RANK 0] deepspeed_activation_checkpointing True [2024-09-03 02:37:24,142] [INFO] [RANK 0] master_ip .................... localhost [2024-09-03 02:37:24,142] [INFO] [RANK 0] master_port .................. 38137 [2024-09-03 02:37:24,142] [INFO] [RANK 0] log_config ................... [{'model': {'scale_factor': 1.15258426, 'disable_first_stage_autocast': True, 'not_trainable_prefixes': ['all'], 'log_keys': ['txt'], 'denoiser_config': {'target': 'sgm.modules.diffusionmodules.denoiser.DiscreteDenoiser', 'params': {'num_idx': 1000, 'quantize_c_noise': False, 'weighting_config': {'target': 'sgm.modules.diffusionmodules.denoiser_weighting.EpsWeighting'}, 'scaling_config': {'target': 'sgm.modules.diffusionmodules.denoiser_scaling.VideoScaling'}, 'discretization_config': {'target': 'sgm.modules.diffusionmodules.discretizer.ZeroSNRDDPMDiscretization', 'params': {'shift_scale': 3.0}}}}, 'network_config': {'target': 'dit_video_concat.DiffusionTransformer', 'params': {'time_embed_dim': 512, 'elementwise_affine': True, 'num_frames': 49, 'time_compressed_rate': 4, 'latent_width': 90, 'latent_height': 60, 'num_layers': 30, 'patch_size': 2, 'in_channels': 16, 'out_channels': 16, 'hidden_size': 1920, 'adm_in_channels': 256, 'num_attention_heads': 30, 'transformer_args': {'checkpoint_activations': True, 'vocab_size': 1, 'max_sequence_length': 64, 'layernorm_order': 'pre', 'skip_init': False, 'model_parallel_size': 1, 'is_decoder': False}, 'modules': {'pos_embed_config': {'target': 'dit_video_concat.Basic3DPositionEmbeddingMixin', 'params': {'text_length': 226, 'height_interpolation': 1.875, 'width_interpolation': 1.875}}, 'lora_config': {'target': 'sat.model.finetune.lora2.LoraMixin', 'params': {'r': 128}}, 'patch_embed_config': {'target': 'dit_video_concat.ImagePatchEmbeddingMixin', 'params': {'text_hidden_size': 4096}}, 'adaln_layer_config': {'target': 'dit_video_concat.AdaLNMixin', 'params': {'qk_ln': True}}, 'final_layer_config': {'target': 'dit_video_concat.FinalLayerMixin'}}}}, 'conditioner_config': {'target': 'sgm.modules.GeneralConditioner', 'params': {'emb_models': [{'is_trainable': False, 'input_key': 'txt', 'ucg_rate': 0.1, 'target': 'sgm.modules.encoders.modules.FrozenT5Embedder', 'params': {'model_dir': 'CogVideoX-2b-sat/t5-v1_1-xxl', 'max_length': 226}}]}}, 'first_stage_config': {'target': 'vae_modules.autoencoder.VideoAutoencoderInferenceWrapper', 'params': {'cp_size': 1, 'ckpt_path': 'CogVideoX-2b-sat/vae/3d-vae.pt', 'ignore_keys': ['loss'], 'loss_config': {'target': 'torch.nn.Identity'}, 'regularizer_config': {'target': 'vae_modules.regularizers.DiagonalGaussianRegularizer'}, 'encoder_config': {'target': 'vae_modules.cp_enc_dec.ContextParallelEncoder3D', 'params': {'double_z': True, 'z_channels': 16, 'resolution': 256, 'in_channels': 3, 'out_ch': 3, 'ch': 128, 'ch_mult': [1, 2, 2, 4], 'attn_resolutions': [], 'num_res_blocks': 3, 'dropout': 0.0, 'gather_norm': True}}, 'decoder_config': {'target': 'vae_modules.cp_enc_dec.ContextParallelDecoder3D', 'params': {'double_z': True, 'z_channels': 16, 'resolution': 256, 'in_channels': 3, 'out_ch': 3, 'ch': 128, 'ch_mult': [1, 2, 2, 4], 'attn_resolutions': [], 'num_res_blocks': 3, 'dropout': 0.0, 'gather_norm': False}}}}, 'loss_fn_config': {'target': 'sgm.modules.diffusionmodules.loss.VideoDiffusionLoss', 'params': {'offset_noise_level': 0, 'sigma_sampler_config': {'target': 'sgm.modules.diffusionmodules.sigma_sampling.DiscreteSampling', 'params': {'uniform_sampling': True, 'num_idx': 1000, 'discretization_config': {'target': 'sgm.modules.diffusionmodules.discretizer.ZeroSNRDDPMDiscretization', 'params': {'shift_scale': 3.0}}}}}}, 'sampler_config': {'target': 'sgm.modules.diffusionmodules.sampling.VPSDEDPMPP2MSampler', 'params': {'num_steps': 50, 'verbose': True, 'discretization_config': {'target': 'sgm.modules.diffusionmodules.discretizer.ZeroSNRDDPMDiscretization', 'params': {'shift_scale': 3.0}}, 'guider_config': {'target': 'sgm.modules.diffusionmodules.guiders.DynamicCFG', 'params': {'scale': 6, 'exp': 5, 'num_steps': 50}}}}}}, {'args': {'checkpoint_activations': True, 'model_parallel_size': 1, 'experiment_name': 'example_data', 'mode': 'finetune', 'load': 'CogVideoX-2b-sat/transformer', 'no_load_rng': True, 'train_iters': 1000, 'eval_iters': 1, 'eval_interval': 100, 'eval_batch_size': 1, 'save': 'ckpts_2b', 'save_interval': 500, 'log_interval': 20, 'train_data': ['toy_data'], 'valid_data': ['toy_data'], 'split': '1,0,0', 'num_workers': 8, 'force_train': True, 'only_log_video_latents': True}, 'data': {'target': 'data_video.SFTDataset', 'params': {'video_size': [480, 720], 'fps': 8, 'max_num_frames': 49, 'skip_frms_num': 3.0}}, 'deepspeed': {'train_micro_batch_size_per_gpu': 2, 'gradient_accumulation_steps': 1, 'steps_per_print': 50, 'gradient_clipping': 0.1, 'zero_optimization': {'stage': 2, 'cpu_offload': False, 'contiguous_gradients': False, 'overlap_comm': True, 'reduce_scatter': True, 'reduce_bucket_size': 1000000000, 'allgather_bucket_size': 1000000000, 'load_from_fp32_weights': False}, 'zero_allow_untested_optimizer': True, 'bf16': {'enabled': False}, 'fp16': {'enabled': True}, 'loss_scale': 0, 'loss_scale_window': 400, 'hysteresis': 2, 'min_loss_scale': 1, 'optimizer': {'type': 'sat.ops.FusedEmaAdam', 'params': {'lr': 0.001, 'betas': [0.9, 0.95], 'eps': '1e-8', 'weight_decay': '1e-4'}}, 'activation_checkpointing': {'partition_activations': False, 'contiguous_memory_optimization': False}, 'wall_clock_breakdown': False}}] [2024-09-03 02:37:24,142] [INFO] [RANK 0] do_train ..................... True [2024-09-03 02:37:24,142] [INFO] [RANK 0] val_last_shape ............... [] [2024-09-03 02:37:24,142] [INFO] [RANK 0] val_drop_number .............. 0 [2024-09-03 02:37:24,142] [INFO] [RANK 0] do_valid ..................... True [2024-09-03 02:37:24,142] [INFO] [RANK 0] do_test ...................... False [2024-09-03 02:37:24,142] [INFO] [RANK 0] iteration .................... 0 [2024-09-03 02:38:05,330] [INFO] [checkpointing.py:541:forward] Activation Checkpointing Information [2024-09-03 02:38:05,330] [INFO] [checkpointing.py:542:forward] ----Partition Activations False, CPU CHECKPOINTING False [2024-09-03 02:38:05,330] [INFO] [checkpointing.py:543:forward] ----contiguous Memory Checkpointing False with None total layers [2024-09-03 02:38:05,330] [INFO] [checkpointing.py:545:forward] ----Synchronization False [2024-09-03 02:38:05,330] [INFO] [checkpointing.py:546:forward] ----Profiling time in checkpointing False [2024-09-03 02:38:17,054] [INFO] [loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 4294967296, reducing to 2147483648 [2024-09-03 02:38:39,582] [INFO] [loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 2147483648, reducing to 1073741824 [2024-09-03 02:39:01,823] [INFO] [loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 1073741824, reducing to 536870912 [2024-09-03 02:39:25,772] [INFO] [loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 536870912, reducing to 268435456 [2024-09-03 02:40:34,024] [INFO] [loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 268435456, reducing to 134217728 [2024-09-03 02:42:31,555] [INFO] [loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 134217728, reducing to 67108864 [2024-09-03 02:43:17,654] [INFO] [loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 67108864, reducing to 33554432 [2024-09-03 02:45:38,242] [INFO] [RANK 0] iteration 20/ 1000 | elapsed time per iteration (ms): 24611.4 | learning rate 5.000E-05 | total loss 2.157213E-01 | loss 2.157214E-01 | loss scale 33554432.0 |speed 4.88 samples/(minGPU) [2024-09-03 02:45:38,244] [INFO] [RANK 0] after 20 iterations memory (MB) | allocated: 13974.6455078125 | max allocated: 64453.90478515625 | cached: 22772.0 | max cached: 76136.0 [2024-09-03 02:45:38,244] [INFO] [RANK 0] time (ms) | forward: 15575.33 | backward: 8930.58 | allreduce: 0.00 | optimizer: 101.48 | data loader: 19.31 [2024-09-03 02:53:15,005] [INFO] [RANK 0] iteration 40/ 1000 | elapsed time per iteration (ms): 22838.1 | learning rate 5.000E-05 | total loss 2.180460E-01 | loss 2.180460E-01 | loss scale 33554432.0 |speed 5.25 samples/(minGPU) [2024-09-03 02:53:15,006] [INFO] [RANK 0] time (ms) | forward: 13797.73 | backward: 8961.48 | allreduce: 0.00 | optimizer: 74.96 | data loader: 0.37 [2024-09-03 02:54:01,688] [INFO] [loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 33554432, reducing to 16777216 [2024-09-03 02:57:02,117] [INFO] [logging.py:96:log_dist] [Rank 0] step=50, skipped=8, lr=[5e-05], mom=[[0.9, 0.95]] bash finetune_single_gpu.sh RUN on instance-butter, CUDA_VISIBLE_DEVICES=6 WORLD_SIZE=1 RANK=0 LOCAL_RANK=0 LOCAL_WORLD_SIZE=1 python train_video.py --base configs/cogvideox_2b_lora.yaml configs/sft.yaml --seed 22338 [2024-09-03 02:36:29,937] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect) [WARNING] async_io requires the dev libaio .so object and headers but these were not found. [WARNING] async_io: please install the libaio-dev package with apt [WARNING] If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found. [WARNING] Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH [WARNING] sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.4 [WARNING] using untested triton version (3.0.0), only 1.0.0 is known to be compatible /data1/anaconda3/envs/cogvideo/lib/python3.10/site-packages/deepspeed/runtime/zero/linear.py:49: FutureWarning: torch.cuda.amp.custom_fwd(args...) is deprecated. Please use torch.amp.custom_fwd(args..., device_type='cuda') instead. def forward(ctx, input, weight, bias=None): /data1/anaconda3/envs/cogvideo/lib/python3.10/site-packages/deepspeed/runtime/zero/linear.py:67: FutureWarning: torch.cuda.amp.custom_bwd(args...) is deprecated. Please use torch.amp.custom_bwd(args..., device_type='cuda') instead. def backward(ctx, grad_output): /data1/anaconda3/envs/cogvideo/lib/python3.10/site-packages/kornia/feature/lightglue.py:44: FutureWarning: torch.cuda.amp.custom_fwd(args...) is deprecated. Please use torch.amp.custom_fwd(args..., device_type='cuda') instead. @torch.cuda.amp.custom_fwd(cast_inputs=torch.float32) /data1/anaconda3/envs/cogvideo/lib/python3.10/site-packages/xformers/ops/fmha/flash.py:211: FutureWarning: torch.library.impl_abstract was renamed to torch.library.register_fake. Please use that instead; we will remove torch.library.impl_abstract in a future version of PyTorch. @torch.library.impl_abstract("xformers_flash::flash_fwd") /data1/anaconda3/envs/cogvideo/lib/python3.10/site-packages/xformers/ops/fmha/flash.py:344: FutureWarning: torch.library.impl_abstract was renamed to torch.library.register_fake. Please use that instead; we will remove torch.library.impl_abstract in a future version of PyTorch. @torch.library.impl_abstract("xformers_flash::flash_bwd") [2024-09-03 02:36:34,890] [INFO] using world size: 1 [2024-09-03 02:36:34,891] [INFO] Will override arguments with manually specified deepspeed_config! [2024-09-03 02:36:34,893] [INFO] [RANK 0] > initializing model parallel with size 1 [2024-09-03 02:36:34,894] [INFO] [comm.py:637:init_distributed] cdb=None [2024-09-03 02:36:34,922] [INFO] [RANK 0] building SATVideoDiffusionEngine model ... [2024-09-03 02:36:44,771] [INFO] [RANK 0] replacing layer 0 attention with lora [2024-09-03 02:36:44,904] [INFO] [RANK 0] replacing layer 1 attention with lora [2024-09-03 02:36:45,021] [INFO] [RANK 0] replacing layer 2 attention with lora [2024-09-03 02:36:45,131] [INFO] [RANK 0] replacing layer 3 attention with lora [2024-09-03 02:36:45,241] [INFO] [RANK 0] replacing layer 4 attention with lora [2024-09-03 02:36:45,352] [INFO] [RANK 0] replacing layer 5 attention with lora [2024-09-03 02:36:45,466] [INFO] [RANK 0] replacing layer 6 attention with lora [2024-09-03 02:36:45,575] [INFO] [RANK 0] replacing layer 7 attention with lora [2024-09-03 02:36:45,687] [INFO] [RANK 0] replacing layer 8 attention with lora [2024-09-03 02:36:45,851] [INFO] [RANK 0] replacing layer 9 attention with lora [2024-09-03 02:36:45,957] [INFO] [RANK 0] replacing layer 10 attention with lora [2024-09-03 02:36:46,063] [INFO] [RANK 0] replacing layer 11 attention with lora [2024-09-03 02:36:46,173] [INFO] [RANK 0] replacing layer 12 attention with lora [2024-09-03 02:36:46,280] [INFO] [RANK 0] replacing layer 13 attention with lora [2024-09-03 02:36:46,387] [INFO] [RANK 0] replacing layer 14 attention with lora [2024-09-03 02:36:46,495] [INFO] [RANK 0] replacing l[2024-09-03 03:00:52,438] [INFO] [RANK 0] iteration 60/ 1000 | elapsed time per iteration (ms): 22871.6 | learning rate 5.000E-05 | total loss 2.016617E-01 | loss 2.016617E-01 | loss scale 16777216.0 |speed 5.25 samples/(min*GPU) [2024-09-03 03:00:52,438] [INFO] [RANK 0] time (ms) | forward: 13888.08 | backward: 8902.19 | allreduce: 0.00 | optimizer: 77.79 | data loader: 0.66

chenxinli001 avatar Sep 03 '24 03:09 chenxinli001

跑的是 finetune_single_gpu.sh

export CUDA_VISIBLE_DEVICES=6

echo "RUN on `hostname`, CUDA_VISIBLE_DEVICES=$CUDA_VISIBLE_DEVICES"

environs="WORLD_SIZE=1 RANK=0 LOCAL_RANK=0 LOCAL_WORLD_SIZE=1"

run_cmd="$environs python train_video.py --base configs/cogvideox_2b_lora.yaml configs/sft.yaml --seed $RANDOM"

echo ${run_cmd}
eval ${run_cmd}

echo "DONE on `hostname`"

频繁遇到 下面这种 超大的loss scale, 并且显示skip the steps, 正常嘛?

[2024-09-03 02:38:17,054] [INFO] [loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 4294967296, reducing to 2147483648 [2024-09-03 02:38:39,582] [INFO] [loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 2147483648, reducing to 1073741824 [2024-09-03 02:39:01,823] [INFO] [loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 1073741824, reducing to 536870912 [2024-09-03 02:39:25,772] [INFO] [loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 536870912, reducing to 268435456 [2024-09-03 02:40:34,024] [INFO] [loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 268435456, reducing to 134217728 [2024-09-03 02:42:31,555] [INFO] [loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 134217728, reducing to 67108864 [2024-09-03 02:43:17,654] [INFO] [loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 67108864, reducing to 33554432

chenxinli001 avatar Sep 03 '24 03:09 chenxinli001

不正常,这个scale也不正常,你的数据集体量是

这个看上去是没有任何报错都是跳过了?

zRzRzRzRzRzRzR avatar Sep 04 '24 11:09 zRzRzRzRzRzRzR

Same behavior here on 4xA100 device 👀

TianxingWu avatar Sep 04 '24 17:09 TianxingWu

same on 8*A800

kyrie111 avatar Sep 06 '24 06:09 kyrie111

Did everyone skip all the steps? @kyrie111 @TianxingWu, because skipping the first few steps and then continuing with normal training, the loss reduction is a normal phenomenon. The first few steps are skipped because the loss is indeed too large

zRzRzRzRzRzRzR avatar Sep 06 '24 17:09 zRzRzRzRzRzRzR

My issue is solved #261

same issue on 8 * A100 80G (Tried Single GPU & Multi GPU 8)

I tried only 2b model

sft.yaml
args:
  checkpoint_activations: True ## using gradient checkpointing
  model_parallel_size: 1
  experiment_name: lora-test
  mode: finetune
  load: "/root/CogVideo/CogVideoX-2b-sat/transformer"
  no_load_rng: True
  train_iters: 100 # Suggest more than 1000 For Lora and SFT For 500 is enough
  eval_iters: 1
  eval_interval: 10
  eval_batch_size: 1
  save: ckpts_2b_lora
  save_interval: 50
  log_interval: 20
  train_data: [ "/root/CogVideo/sat/datasets/test" ] # Train data path
  valid_data: [ "/root/CogVideo/sat/datasets/test" ] # Validation data path, can be the same as train_data(not recommended)
  split: 1,0,0
  num_workers: 8
  force_train: True
  only_log_video_latents: True

data:
  target: data_video.SFTDataset
  params:
    video_size: [ 480, 720 ]
    fps: 8
    max_num_frames: 49
    skip_frms_num: 3.

deepspeed:
  # Minimun for 16 videos per batch for ALL GPUs, This setting is for 8 x A100 GPUs
  train_micro_batch_size_per_gpu: 2
  gradient_accumulation_steps: 1
  steps_per_print: 50
  gradient_clipping: 0.1
  zero_optimization:
    stage: 2
    cpu_offload: false
    contiguous_gradients: false
    overlap_comm: true
    reduce_scatter: true
    reduce_bucket_size: 1000000000
    allgather_bucket_size: 1000000000
    load_from_fp32_weights: false
  zero_allow_untested_optimizer: true
  bf16:
      enabled: False  # For CogVideoX-2B Turn to False and For CogVideoX-5B Turn to True
  fp16:
      enabled: True  # For CogVideoX-2B Turn to True and For CogVideoX-5B Turn to False
  loss_scale: 0
  loss_scale_window: 400
  hysteresis: 2
  min_loss_scale: 1

  optimizer:
    type: sat.ops.FusedEmaAdam
    params:
      lr: 0.001 # Between 1E-3 and 5E-4 For Lora and 1E-5 For SFT
      betas: [ 0.9, 0.95 ]
      eps: 1e-8
      weight_decay: 1e-4
  activation_checkpointing:
    partition_activations: false
    contiguous_memory_optimization: false
  wall_clock_breakdown: false

cogvideox_2b.yaml
model:
  scale_factor: 1.15258426
  disable_first_stage_autocast: true
  log_keys:
    - txt

  denoiser_config:
    target: sgm.modules.diffusionmodules.denoiser.DiscreteDenoiser
    params:
      num_idx: 1000
      quantize_c_noise: False

      weighting_config:
        target: sgm.modules.diffusionmodules.denoiser_weighting.EpsWeighting
      scaling_config:
        target: sgm.modules.diffusionmodules.denoiser_scaling.VideoScaling
      discretization_config:
        target: sgm.modules.diffusionmodules.discretizer.ZeroSNRDDPMDiscretization
        params:
          shift_scale: 3.0

  network_config:
    target: dit_video_concat.DiffusionTransformer
    params:
      time_embed_dim: 512
      elementwise_affine: True
      num_frames: 49
      time_compressed_rate: 4
      latent_width: 90
      latent_height: 60
      num_layers: 30
      patch_size: 2
      in_channels: 16
      out_channels: 16
      hidden_size: 1920
      adm_in_channels: 256
      num_attention_heads: 30

      transformer_args:
        checkpoint_activations: True ## using gradient checkpointing
        vocab_size: 1
        max_sequence_length: 64
        layernorm_order: pre
        skip_init: false
        model_parallel_size: 1
        is_decoder: false

      modules:
        pos_embed_config:
          target: dit_video_concat.Basic3DPositionEmbeddingMixin
          params:
            text_length: 226
            height_interpolation: 1.875
            width_interpolation: 1.875

        patch_embed_config:
          target: dit_video_concat.ImagePatchEmbeddingMixin
          params:
            text_hidden_size: 4096

        adaln_layer_config:
          target: dit_video_concat.AdaLNMixin
          params:
            qk_ln: True

        final_layer_config:
          target: dit_video_concat.FinalLayerMixin

  conditioner_config:
    target: sgm.modules.GeneralConditioner
    params:
      emb_models:
        - is_trainable: false
          input_key: txt
          ucg_rate: 0.1
          target: sgm.modules.encoders.modules.FrozenT5Embedder
          params:
            model_dir: "/root/CogVideo/t5-v1_1-xxl"
            max_length: 226

  first_stage_config:
    target: vae_modules.autoencoder.VideoAutoencoderInferenceWrapper
    params:
      cp_size: 1
      ckpt_path: "/root/CogVideo/CogVideoX-2b-sat/vae/3d-vae.pt"
      ignore_keys: [ 'loss' ]

      loss_config:
        target: torch.nn.Identity

      regularizer_config:
        target: vae_modules.regularizers.DiagonalGaussianRegularizer

      encoder_config:
        target: vae_modules.cp_enc_dec.ContextParallelEncoder3D
        params:
          double_z: true
          z_channels: 16
          resolution: 256
          in_channels: 3
          out_ch: 3
          ch: 128
          ch_mult: [ 1, 2, 2, 4 ]
          attn_resolutions: [ ]
          num_res_blocks: 3
          dropout: 0.0
          gather_norm: True

      decoder_config:
        target: vae_modules.cp_enc_dec.ContextParallelDecoder3D
        params:
          double_z: True
          z_channels: 16
          resolution: 256
          in_channels: 3
          out_ch: 3
          ch: 128
          ch_mult: [ 1, 2, 2, 4 ]
          attn_resolutions: [ ]
          num_res_blocks: 3
          dropout: 0.0
          gather_norm: False

  loss_fn_config:
    target: sgm.modules.diffusionmodules.loss.VideoDiffusionLoss
    params:
      offset_noise_level: 0
      sigma_sampler_config:
        target: sgm.modules.diffusionmodules.sigma_sampling.DiscreteSampling
        params:
          uniform_sampling: True
          num_idx: 1000
          discretization_config:
            target: sgm.modules.diffusionmodules.discretizer.ZeroSNRDDPMDiscretization
            params:
              shift_scale: 3.0

  sampler_config:
    target: sgm.modules.diffusionmodules.sampling.VPSDEDPMPP2MSampler
    params:
      num_steps: 50
      verbose: True

      discretization_config:
        target: sgm.modules.diffusionmodules.discretizer.ZeroSNRDDPMDiscretization
        params:
          shift_scale: 3.0

      guider_config:
        target: sgm.modules.diffusionmodules.guiders.DynamicCFG
        params:
          scale: 6
          exp: 5
          num_steps: 50
[1st Trial] finetune_single_gpu.sh
RUN on alphacode-ttv-a100-80g-gpu, CUDA_VISIBLE_DEVICES=
WORLD_SIZE=1 RANK=0 LOCAL_RANK=0 LOCAL_WORLD_SIZE=1 python train_video.py --base configs/cogvideox_2b_lora.yaml configs/sft.yaml --seed 21247
[2024-09-09 16:39:11,302] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
 [WARNING]  async_io requires the dev libaio .so object and headers but these were not found.
 [WARNING]  async_io: please install the libaio-dev package with apt
 [WARNING]  If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
 [WARNING]  Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
 [WARNING]  sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.4
 [WARNING]  using untested triton version (3.0.0), only 1.0.0 is known to be compatible
/root/miniconda3/envs/cogvideo/lib/python3.12/site-packages/deepspeed/runtime/zero/linear.py:47: FutureWarning: `torch.cuda.amp.custom_fwd(args...)` is deprecated. Please use `torch.amp.custom_fwd(args..., device_type='cuda')` instead.
  @autocast_custom_fwd
/root/miniconda3/envs/cogvideo/lib/python3.12/site-packages/deepspeed/runtime/zero/linear.py:66: FutureWarning: `torch.cuda.amp.custom_bwd(args...)` is deprecated. Please use `torch.amp.custom_bwd(args..., device_type='cuda')` instead.
  @autocast_custom_bwd
/root/miniconda3/envs/cogvideo/lib/python3.12/site-packages/kornia/feature/lightglue.py:44: FutureWarning: `torch.cuda.amp.custom_fwd(args...)` is deprecated. Please use `torch.amp.custom_fwd(args..., device_type='cuda')` instead.
  @torch.cuda.amp.custom_fwd(cast_inputs=torch.float32)
no module 'xformers'. Processing without...
no module 'xformers'. Processing without...
[2024-09-09 16:39:16,571] [INFO] using world size: 1
[2024-09-09 16:39:16,571] [INFO] Will override arguments with manually specified deepspeed_config!
[W909 16:39:16.412494279 socket.cpp:697] [c10d] The client socket cannot be initialized to connect to [ip6-localhost]:39375 (errno: 97 - Address family not supported by protocol).
[W909 16:39:16.413593009 socket.cpp:697] [c10d] The client socket cannot be initialized to connect to [alphacode-ttv-a100-80g-gpu]:39375 (errno: 97 - Address family not supported by protocol).
[2024-09-09 16:39:16,591] [INFO] [RANK 0] > initializing model parallel with size 1
[2024-09-09 16:39:16,592] [INFO] [comm.py:637:init_distributed] cdb=None
[2024-09-09 16:39:16,869] [INFO] [RANK 0] building SATVideoDiffusionEngine model ...
[2024-09-09 16:39:26,340] [WARNING] [RANK 0] Failed to load bitsandbytes:No module named 'bitsandbytes'
[2024-09-09 16:39:26,340] [INFO] [RANK 0] replacing layer 0 attention with lora
[2024-09-09 16:39:26,364] [INFO] [RANK 0] replacing layer 1 attention with lora
[2024-09-09 16:39:26,387] [INFO] [RANK 0] replacing layer 2 attention with lora
[2024-09-09 16:39:26,411] [INFO] [RANK 0] replacing layer 3 attention with lora
[2024-09-09 16:39:26,487] [INFO] [RANK 0] replacing layer 4 attention with lora
[2024-09-09 16:39:26,518] [INFO] [RANK 0] replacing layer 5 attention with lora
[2024-09-09 16:39:26,542] [INFO] [RANK 0] replacing layer 6 attention with lora
[2024-09-09 16:39:26,567] [INFO] [RANK 0] replacing layer 7 attention with lora
[2024-09-09 16:39:26,591] [INFO] [RANK 0] replacing layer 8 attention with lora
[2024-09-09 16:39:26,621] [INFO] [RANK 0] replacing layer 9 attention with lora
[2024-09-09 16:39:26,726] [INFO] [RANK 0] replacing layer 10 attention with lora
[2024-09-09 16:39:26,870] [INFO] [RANK 0] replacing layer 11 attention with lora
[2024-09-09 16:39:26,999] [INFO] [RANK 0] replacing layer 12 attention with lora
[2024-09-09 16:39:27,074] [INFO] [RANK 0] replacing layer 13 attention with lora
[2024-09-09 16:39:27,127] [INFO] [RANK 0] replacing layer 14 attention with lora
[2024-09-09 16:39:27,206] [INFO] [RANK 0] replacing layer 15 attention with lora
[2024-09-09 16:39:27,294] [INFO] [RANK 0] replacing layer 16 attention with lora
[2024-09-09 16:39:27,379] [INFO] [RANK 0] replacing layer 17 attention with lora
[2024-09-09 16:39:27,446] [INFO] [RANK 0] replacing layer 18 attention with lora
[2024-09-09 16:39:27,528] [INFO] [RANK 0] replacing layer 19 attention with lora
[2024-09-09 16:39:27,642] [INFO] [RANK 0] replacing layer 20 attention with lora
[2024-09-09 16:39:27,715] [INFO] [RANK 0] replacing layer 21 attention with lora
[2024-09-09 16:39:27,794] [INFO] [RANK 0] replacing layer 22 attention with lora
[2024-09-09 16:39:27,854] [INFO] [RANK 0] replacing layer 23 attention with lora
[2024-09-09 16:39:27,930] [INFO] [RANK 0] replacing layer 24 attention with lora
[2024-09-09 16:39:27,960] [INFO] [RANK 0] replacing layer 25 attention with lora
[2024-09-09 16:39:27,982] [INFO] [RANK 0] replacing layer 26 attention with lora
[2024-09-09 16:39:28,004] [INFO] [RANK 0] replacing layer 27 attention with lora
[2024-09-09 16:39:28,026] [INFO] [RANK 0] replacing layer 28 attention with lora
[2024-09-09 16:39:28,048] [INFO] [RANK 0] replacing layer 29 attention with lora
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:01<00:00,  1.13it/s]
Initialized embedder #0: FrozenT5Embedder with 4762310656 params. Trainable: False
Working with z of shape (1, 16, 32, 32) = 16384 dimensions.
/root/CogVideo/sat/vae_modules/autoencoder.py:565: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature.
  sd = torch.load(path, map_location="cpu")["state_dict"]
Deleting key loss.logvar from state_dict.
Deleting key loss.perceptual_loss.scaling_layer.shift from state_dict.
Deleting key loss.perceptual_loss.scaling_layer.scale from state_dict.
Deleting key loss.perceptual_loss.net.slice1.0.weight from state_dict.
Deleting key loss.perceptual_loss.net.slice1.0.bias from state_dict.
Deleting key loss.perceptual_loss.net.slice1.2.weight from state_dict.
Deleting key loss.perceptual_loss.net.slice1.2.bias from state_dict.
Deleting key loss.perceptual_loss.net.slice2.5.weight from state_dict.
Deleting key loss.perceptual_loss.net.slice2.5.bias from state_dict.
Deleting key loss.perceptual_loss.net.slice2.7.weight from state_dict.
Deleting key loss.perceptual_loss.net.slice2.7.bias from state_dict.
Deleting key loss.perceptual_loss.net.slice3.10.weight from state_dict.
Deleting key loss.perceptual_loss.net.slice3.10.bias from state_dict.
Deleting key loss.perceptual_loss.net.slice3.12.weight from state_dict.
Deleting key loss.perceptual_loss.net.slice3.12.bias from state_dict.
Deleting key loss.perceptual_loss.net.slice3.14.weight from state_dict.
Deleting key loss.perceptual_loss.net.slice3.14.bias from state_dict.
Deleting key loss.perceptual_loss.net.slice4.17.weight from state_dict.
Deleting key loss.perceptual_loss.net.slice4.17.bias from state_dict.
Deleting key loss.perceptual_loss.net.slice4.19.weight from state_dict.
Deleting key loss.perceptual_loss.net.slice4.19.bias from state_dict.
Deleting key loss.perceptual_loss.net.slice4.21.weight from state_dict.
Deleting key loss.perceptual_loss.net.slice4.21.bias from state_dict.
Deleting key loss.perceptual_loss.net.slice5.24.weight from state_dict.
Deleting key loss.perceptual_loss.net.slice5.24.bias from state_dict.
Deleting key loss.perceptual_loss.net.slice5.26.weight from state_dict.
Deleting key loss.perceptual_loss.net.slice5.26.bias from state_dict.
Deleting key loss.perceptual_loss.net.slice5.28.weight from state_dict.
Deleting key loss.perceptual_loss.net.slice5.28.bias from state_dict.
Deleting key loss.perceptual_loss.lin0.model.1.weight from state_dict.
Deleting key loss.perceptual_loss.lin1.model.1.weight from state_dict.
Deleting key loss.perceptual_loss.lin2.model.1.weight from state_dict.
Deleting key loss.perceptual_loss.lin3.model.1.weight from state_dict.
Deleting key loss.perceptual_loss.lin4.model.1.weight from state_dict.
Deleting key loss.discriminator.blocks.0.downsample_res.conv.weight from state_dict.
Deleting key loss.discriminator.blocks.0.downsample_res.conv.bias from state_dict.
Deleting key loss.discriminator.blocks.0.net.0.conv.weight from state_dict.
Deleting key loss.discriminator.blocks.0.net.0.conv.bias from state_dict.
Deleting key loss.discriminator.blocks.0.net.2.conv.weight from state_dict.
Deleting key loss.discriminator.blocks.0.net.2.conv.bias from state_dict.
Deleting key loss.discriminator.blocks.0.downsample.conv.weight from state_dict.
Deleting key loss.discriminator.blocks.0.downsample.conv.bias from state_dict.
Deleting key loss.discriminator.blocks.1.downsample_res.conv.weight from state_dict.
Deleting key loss.discriminator.blocks.1.downsample_res.conv.bias from state_dict.
Deleting key loss.discriminator.blocks.1.net.0.conv.weight from state_dict.
Deleting key loss.discriminator.blocks.1.net.0.conv.bias from state_dict.
Deleting key loss.discriminator.blocks.1.net.2.conv.weight from state_dict.
Deleting key loss.discriminator.blocks.1.net.2.conv.bias from state_dict.
Deleting key loss.discriminator.blocks.1.downsample.conv.weight from state_dict.
Deleting key loss.discriminator.blocks.1.downsample.conv.bias from state_dict.
Deleting key loss.discriminator.blocks.2.downsample_res.conv.weight from state_dict.
Deleting key loss.discriminator.blocks.2.downsample_res.conv.bias from state_dict.
Deleting key loss.discriminator.blocks.2.net.0.conv.weight from state_dict.
Deleting key loss.discriminator.blocks.2.net.0.conv.bias from state_dict.
Deleting key loss.discriminator.blocks.2.net.2.conv.weight from state_dict.
Deleting key loss.discriminator.blocks.2.net.2.conv.bias from state_dict.
Deleting key loss.discriminator.blocks.2.downsample.conv.weight from state_dict.
Deleting key loss.discriminator.blocks.2.downsample.conv.bias from state_dict.
Deleting key loss.discriminator.blocks.3.downsample_res.conv.weight from state_dict.
Deleting key loss.discriminator.blocks.3.downsample_res.conv.bias from state_dict.
Deleting key loss.discriminator.blocks.3.net.0.conv.weight from state_dict.
Deleting key loss.discriminator.blocks.3.net.0.conv.bias from state_dict.
Deleting key loss.discriminator.blocks.3.net.2.conv.weight from state_dict.
Deleting key loss.discriminator.blocks.3.net.2.conv.bias from state_dict.
Deleting key loss.discriminator.blocks.3.downsample.conv.weight from state_dict.
Deleting key loss.discriminator.blocks.3.downsample.conv.bias from state_dict.
Deleting key loss.discriminator.blocks.4.0.conv_res.weight from state_dict.
Deleting key loss.discriminator.blocks.4.0.conv_res.bias from state_dict.
Deleting key loss.discriminator.blocks.4.0.net.0.weight from state_dict.
Deleting key loss.discriminator.blocks.4.0.net.0.bias from state_dict.
Deleting key loss.discriminator.blocks.4.0.net.2.weight from state_dict.
Deleting key loss.discriminator.blocks.4.0.net.2.bias from state_dict.
Deleting key loss.discriminator.blocks.4.0.downsample.1.weight from state_dict.
Deleting key loss.discriminator.blocks.4.0.downsample.1.bias from state_dict.
Deleting key loss.discriminator.blocks.4.1.0.fn.norm.gamma from state_dict.
Deleting key loss.discriminator.blocks.4.1.0.fn.attn.to_q.0.weight from state_dict.
Deleting key loss.discriminator.blocks.4.1.0.fn.attn.to_kv.0.weight from state_dict.
Deleting key loss.discriminator.blocks.4.1.0.fn.attn.to_out.0.weight from state_dict.
Deleting key loss.discriminator.blocks.4.1.1.fn.norm.gamma from state_dict.
Deleting key loss.discriminator.blocks.4.1.1.fn.net.0.weight from state_dict.
Deleting key loss.discriminator.blocks.4.1.1.fn.net.0.bias from state_dict.
Deleting key loss.discriminator.blocks.4.1.1.fn.net.2.weight from state_dict.
Deleting key loss.discriminator.blocks.4.1.1.fn.net.2.bias from state_dict.
Deleting key loss.discriminator.blocks.5.0.conv_res.weight from state_dict.
Deleting key loss.discriminator.blocks.5.0.conv_res.bias from state_dict.
Deleting key loss.discriminator.blocks.5.0.net.0.weight from state_dict.
Deleting key loss.discriminator.blocks.5.0.net.0.bias from state_dict.
Deleting key loss.discriminator.blocks.5.0.net.2.weight from state_dict.
Deleting key loss.discriminator.blocks.5.0.net.2.bias from state_dict.
Deleting key loss.discriminator.blocks.5.0.downsample.1.weight from state_dict.
Deleting key loss.discriminator.blocks.5.0.downsample.1.bias from state_dict.
Deleting key loss.discriminator.blocks.5.1.0.fn.norm.gamma from state_dict.
Deleting key loss.discriminator.blocks.5.1.0.fn.attn.to_q.0.weight from state_dict.
Deleting key loss.discriminator.blocks.5.1.0.fn.attn.to_kv.0.weight from state_dict.
Deleting key loss.discriminator.blocks.5.1.0.fn.attn.to_out.0.weight from state_dict.
Deleting key loss.discriminator.blocks.5.1.1.fn.norm.gamma from state_dict.
Deleting key loss.discriminator.blocks.5.1.1.fn.net.0.weight from state_dict.
Deleting key loss.discriminator.blocks.5.1.1.fn.net.0.bias from state_dict.
Deleting key loss.discriminator.blocks.5.1.1.fn.net.2.weight from state_dict.
Deleting key loss.discriminator.blocks.5.1.1.fn.net.2.bias from state_dict.
Deleting key loss.discriminator.blocks.6.0.conv_res.weight from state_dict.
Deleting key loss.discriminator.blocks.6.0.conv_res.bias from state_dict.
Deleting key loss.discriminator.blocks.6.0.net.0.weight from state_dict.
Deleting key loss.discriminator.blocks.6.0.net.0.bias from state_dict.
Deleting key loss.discriminator.blocks.6.0.net.2.weight from state_dict.
Deleting key loss.discriminator.blocks.6.0.net.2.bias from state_dict.
Deleting key loss.discriminator.blocks.6.1.0.fn.norm.gamma from state_dict.
Deleting key loss.discriminator.blocks.6.1.0.fn.attn.to_q.0.weight from state_dict.
Deleting key loss.discriminator.blocks.6.1.0.fn.attn.to_kv.0.weight from state_dict.
Deleting key loss.discriminator.blocks.6.1.0.fn.attn.to_out.0.weight from state_dict.
Deleting key loss.discriminator.blocks.6.1.1.fn.norm.gamma from state_dict.
Deleting key loss.discriminator.blocks.6.1.1.fn.net.0.weight from state_dict.
Deleting key loss.discriminator.blocks.6.1.1.fn.net.0.bias from state_dict.
Deleting key loss.discriminator.blocks.6.1.1.fn.net.2.weight from state_dict.
Deleting key loss.discriminator.blocks.6.1.1.fn.net.2.bias from state_dict.
Deleting key loss.discriminator.to_logits.0.weight from state_dict.
Deleting key loss.discriminator.to_logits.0.bias from state_dict.
Deleting key loss.discriminator.to_logits.3.weight from state_dict.
Deleting key loss.discriminator.to_logits.3.bias from state_dict.
Missing keys:  []
Unexpected keys:  []
Restored from /root/CogVideo/CogVideoX-2b-sat/vae/3d-vae.pt
[2024-09-09 16:39:32,189] [INFO] [RANK 0]  > number of parameters on model parallel rank 0: 6764790755
[2024-09-09 16:39:42,369] [INFO] [RANK 0] global rank 0 is loading checkpoint /root/CogVideo/CogVideoX-2b-sat/transformer/1000/mp_rank_00_model_states.pt
/root/miniconda3/envs/cogvideo/lib/python3.12/site-packages/sat/training/model_io.py:286: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature.
  sd = torch.load(checkpoint_name, map_location='cpu')
[2024-09-09 16:39:43,764] [INFO] [RANK 0] > successfully loaded /root/CogVideo/CogVideoX-2b-sat/transformer/1000/mp_rank_00_model_states.pt
[2024-09-09 16:39:45,132] [INFO] [RANK 0] ***** Total trainable parameters: 58982400 *****
[2024-09-09 16:39:45,132] [INFO] [RANK 0] [<class 'sat.ops.layernorm.LayerNorm'>, <class 'torch.nn.modules.normalization.LayerNorm'>, <class 'sat.ops.layernorm.RMSNorm'>] is set to no_weight_decay
[2024-09-09 16:39:45,136] [INFO] [RANK 0] Syncing initialized parameters...
[2024-09-09 16:39:45,239] [INFO] [RANK 0] Finished syncing initialized parameters.
[2024-09-09 16:39:45,239] [INFO] [RANK 0] Using optimizer sat.ops.FusedEmaAdam from sat.
[2024-09-09 16:39:45,239] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed info: version=0.14.4, git-hash=unknown, git-branch=unknown
[2024-09-09 16:39:45,240] [WARNING] [config_utils.py:69:_process_deprecated_field] Config parameter cpu_offload is deprecated use offload_optimizer instead
[2024-09-09 16:39:45,337] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Flops Profiler Enabled: False
Using /root/.cache/torch_extensions/py312_cu121 as PyTorch extensions root...
Detected CUDA files, patching ldflags
Emitting ninja build file /root/.cache/torch_extensions/py312_cu121/fused_ema_adam/build.ninja...
/root/miniconda3/envs/cogvideo/lib/python3.12/site-packages/torch/utils/cpp_extension.py:1965: UserWarning: TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'].
  warnings.warn(
Building extension module fused_ema_adam...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module fused_ema_adam...
Time to load fused_ema_adam op: 0.7258331775665283 seconds
[2024-09-09 16:39:46,219] [INFO] [logging.py:96:log_dist] [Rank 0] Using client callable to create basic optimizer
[2024-09-09 16:39:46,219] [INFO] [logging.py:96:log_dist] [Rank 0] Removing param_group that has no 'params' in the basic Optimizer
[2024-09-09 16:39:46,239] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Basic Optimizer = FusedEmaAdam
[2024-09-09 16:39:46,239] [INFO] [utils.py:56:is_zero_supported_optimizer] Checking ZeRO support for optimizer=FusedEmaAdam type=<class 'sat.ops.fused_ema_adam.FusedEmaAdam'>
[2024-09-09 16:39:46,239] [WARNING] [engine.py:1179:_do_optimizer_sanity_check] **** You are using ZeRO with an untested optimizer, proceed with caution *****
[2024-09-09 16:39:46,239] [INFO] [logging.py:96:log_dist] [Rank 0] Creating torch.float16 ZeRO stage 2 optimizer
[2024-09-09 16:39:46,239] [INFO] [stage_1_and_2.py:148:__init__] Reduce bucket size 1000000000
[2024-09-09 16:39:46,239] [INFO] [stage_1_and_2.py:149:__init__] Allgather bucket size 1000000000
[2024-09-09 16:39:46,239] [INFO] [stage_1_and_2.py:150:__init__] CPU Offload: False
[2024-09-09 16:39:46,239] [INFO] [stage_1_and_2.py:151:__init__] Round robin gradient partitioning: False
[2024-09-09 16:39:48,450] [INFO] [utils.py:781:see_memory_usage] Before initializing optimizer states
[2024-09-09 16:39:48,450] [INFO] [utils.py:782:see_memory_usage] MA 12.86 GB         Max_MA 12.97 GB         CA 13.23 GB         Max_CA 13 GB 
[2024-09-09 16:39:48,451] [INFO] [utils.py:789:see_memory_usage] CPU Virtual Memory:  used = 31.17 GB, percent = 1.6%
[2024-09-09 16:39:48,690] [INFO] [utils.py:781:see_memory_usage] After initializing optimizer states
[2024-09-09 16:39:48,691] [INFO] [utils.py:782:see_memory_usage] MA 12.86 GB         Max_MA 13.08 GB         CA 13.45 GB         Max_CA 13 GB 
[2024-09-09 16:39:48,691] [INFO] [utils.py:789:see_memory_usage] CPU Virtual Memory:  used = 31.19 GB, percent = 1.6%
[2024-09-09 16:39:48,691] [INFO] [stage_1_and_2.py:543:__init__] optimizer state initialized
[2024-09-09 16:39:48,948] [INFO] [utils.py:781:see_memory_usage] After initializing ZeRO optimizer
[2024-09-09 16:39:48,949] [INFO] [utils.py:782:see_memory_usage] MA 12.86 GB         Max_MA 12.86 GB         CA 13.45 GB         Max_CA 13 GB 
[2024-09-09 16:39:48,949] [INFO] [utils.py:789:see_memory_usage] CPU Virtual Memory:  used = 31.25 GB, percent = 1.6%
[2024-09-09 16:39:48,953] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Final Optimizer = DeepSpeedZeroOptimizer
[2024-09-09 16:39:48,953] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed using client LR scheduler
[2024-09-09 16:39:48,953] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed LR Scheduler = None
[2024-09-09 16:39:48,954] [INFO] [logging.py:96:log_dist] [Rank 0] step=0, skipped=0, lr=[1.0], mom=[[0.9, 0.95]]
[2024-09-09 16:39:48,956] [INFO] [config.py:997:print] DeepSpeedEngine configuration:
[2024-09-09 16:39:48,957] [INFO] [config.py:1001:print]   activation_checkpointing_config  {
    "partition_activations": false, 
    "contiguous_memory_optimization": false, 
    "cpu_checkpointing": false, 
    "number_checkpoints": null, 
    "synchronize_checkpoint_boundary": false, 
    "profile": false
}
[2024-09-09 16:39:48,957] [INFO] [config.py:1001:print]   aio_config ................... {'block_size': 1048576, 'queue_depth': 8, 'thread_count': 1, 'single_submit': False, 'overlap_events': True}
[2024-09-09 16:39:48,957] [INFO] [config.py:1001:print]   amp_enabled .................. False
[2024-09-09 16:39:48,957] [INFO] [config.py:1001:print]   amp_params ................... False
[2024-09-09 16:39:48,958] [INFO] [config.py:1001:print]   autotuning_config ............ {
    "enabled": false, 
    "start_step": null, 
    "end_step": null, 
    "metric_path": null, 
    "arg_mappings": null, 
    "metric": "throughput", 
    "model_info": null, 
    "results_dir": "autotuning_results", 
    "exps_dir": "autotuning_exps", 
    "overwrite": true, 
    "fast": true, 
    "start_profile_step": 3, 
    "end_profile_step": 5, 
    "tuner_type": "gridsearch", 
    "tuner_early_stopping": 5, 
    "tuner_num_trials": 50, 
    "model_info_path": null, 
    "mp_size": 1, 
    "max_train_batch_size": null, 
    "min_train_batch_size": 1, 
    "max_train_micro_batch_size_per_gpu": 1.024000e+03, 
    "min_train_micro_batch_size_per_gpu": 1, 
    "num_tuning_micro_batch_sizes": 3
}
[2024-09-09 16:39:48,958] [INFO] [config.py:1001:print]   bfloat16_enabled ............. False
[2024-09-09 16:39:48,958] [INFO] [config.py:1001:print]   bfloat16_immediate_grad_update  False
[2024-09-09 16:39:48,958] [INFO] [config.py:1001:print]   checkpoint_parallel_write_pipeline  False
[2024-09-09 16:39:48,958] [INFO] [config.py:1001:print]   checkpoint_tag_validation_enabled  True
[2024-09-09 16:39:48,958] [INFO] [config.py:1001:print]   checkpoint_tag_validation_fail  False
[2024-09-09 16:39:48,958] [INFO] [config.py:1001:print]   comms_config ................. <deepspeed.comm.config.DeepSpeedCommsConfig object at 0x7f0915235d60>
[2024-09-09 16:39:48,958] [INFO] [config.py:1001:print]   communication_data_type ...... None
[2024-09-09 16:39:48,958] [INFO] [config.py:1001:print]   compression_config ........... {'weight_quantization': {'shared_parameters': {'enabled': False, 'quantizer_kernel': False, 'schedule_offset': 0, 'quantize_groups': 1, 'quantize_verbose': False, 'quantization_type': 'symmetric', 'quantize_weight_in_forward': False, 'rounding': 'nearest', 'fp16_mixed_quantize': False, 'quantize_change_ratio': 0.001}, 'different_groups': {}}, 'activation_quantization': {'shared_parameters': {'enabled': False, 'quantization_type': 'symmetric', 'range_calibration': 'dynamic', 'schedule_offset': 1000}, 'different_groups': {}}, 'sparse_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'row_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'head_pruning': {'shared_parameters': {'enabled': False, 'method': 'topk', 'schedule_offset': 1000}, 'different_groups': {}}, 'channel_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'layer_reduction': {'enabled': False}}
[2024-09-09 16:39:48,958] [INFO] [config.py:1001:print]   curriculum_enabled_legacy .... False
[2024-09-09 16:39:48,958] [INFO] [config.py:1001:print]   curriculum_params_legacy ..... False
[2024-09-09 16:39:48,958] [INFO] [config.py:1001:print]   data_efficiency_config ....... {'enabled': False, 'seed': 1234, 'data_sampling': {'enabled': False, 'num_epochs': 1000, 'num_workers': 0, 'curriculum_learning': {'enabled': False}}, 'data_routing': {'enabled': False, 'random_ltd': {'enabled': False, 'layer_token_lr_schedule': {'enabled': False}}}}
[2024-09-09 16:39:48,958] [INFO] [config.py:1001:print]   data_efficiency_enabled ...... False
[2024-09-09 16:39:48,958] [INFO] [config.py:1001:print]   dataloader_drop_last ......... False
[2024-09-09 16:39:48,958] [INFO] [config.py:1001:print]   disable_allgather ............ False
[2024-09-09 16:39:48,958] [INFO] [config.py:1001:print]   dump_state ................... False
[2024-09-09 16:39:48,958] [INFO] [config.py:1001:print]   dynamic_loss_scale_args ...... None
[2024-09-09 16:39:48,958] [INFO] [config.py:1001:print]   eigenvalue_enabled ........... False
[2024-09-09 16:39:48,958] [INFO] [config.py:1001:print]   eigenvalue_gas_boundary_resolution  1
[2024-09-09 16:39:48,958] [INFO] [config.py:1001:print]   eigenvalue_layer_name ........ bert.encoder.layer
[2024-09-09 16:39:48,958] [INFO] [config.py:1001:print]   eigenvalue_layer_num ......... 0
[2024-09-09 16:39:48,958] [INFO] [config.py:1001:print]   eigenvalue_max_iter .......... 100
[2024-09-09 16:39:48,958] [INFO] [config.py:1001:print]   eigenvalue_stability ......... 1e-06
[2024-09-09 16:39:48,958] [INFO] [config.py:1001:print]   eigenvalue_tol ............... 0.01
[2024-09-09 16:39:48,958] [INFO] [config.py:1001:print]   eigenvalue_verbose ........... False
[2024-09-09 16:39:48,958] [INFO] [config.py:1001:print]   elasticity_enabled ........... False
[2024-09-09 16:39:48,958] [INFO] [config.py:1001:print]   flops_profiler_config ........ {
    "enabled": false, 
    "recompute_fwd_factor": 0.0, 
    "profile_step": 1, 
    "module_depth": -1, 
    "top_modules": 1, 
    "detailed": true, 
    "output_file": null
}
[2024-09-09 16:39:48,959] [INFO] [config.py:1001:print]   fp16_auto_cast ............... False
[2024-09-09 16:39:48,959] [INFO] [config.py:1001:print]   fp16_enabled ................. True
[2024-09-09 16:39:48,959] [INFO] [config.py:1001:print]   fp16_master_weights_and_gradients  False
[2024-09-09 16:39:48,959] [INFO] [config.py:1001:print]   global_rank .................. 0
[2024-09-09 16:39:48,959] [INFO] [config.py:1001:print]   grad_accum_dtype ............. None
[2024-09-09 16:39:48,959] [INFO] [config.py:1001:print]   gradient_accumulation_steps .. 1
[2024-09-09 16:39:48,959] [INFO] [config.py:1001:print]   gradient_clipping ............ 0.1
[2024-09-09 16:39:48,959] [INFO] [config.py:1001:print]   gradient_predivide_factor .... 1.0
[2024-09-09 16:39:48,959] [INFO] [config.py:1001:print]   graph_harvesting ............. False
[2024-09-09 16:39:48,959] [INFO] [config.py:1001:print]   hybrid_engine ................ enabled=False max_out_tokens=512 inference_tp_size=1 release_inference_cache=False pin_parameters=True tp_gather_partition_size=8
[2024-09-09 16:39:48,959] [INFO] [config.py:1001:print]   initial_dynamic_scale ........ 65536
[2024-09-09 16:39:48,959] [INFO] [config.py:1001:print]   load_universal_checkpoint .... False
[2024-09-09 16:39:48,959] [INFO] [config.py:1001:print]   loss_scale ................... 0
[2024-09-09 16:39:48,959] [INFO] [config.py:1001:print]   memory_breakdown ............. False
[2024-09-09 16:39:48,959] [INFO] [config.py:1001:print]   mics_hierarchial_params_gather  False
[2024-09-09 16:39:48,959] [INFO] [config.py:1001:print]   mics_shard_size .............. -1
[2024-09-09 16:39:48,959] [INFO] [config.py:1001:print]   monitor_config ............... tensorboard=TensorBoardConfig(enabled=False, output_path='', job_name='DeepSpeedJobName') comet=CometConfig(enabled=False, samples_log_interval=100, project=None, workspace=None, api_key=None, experiment_name=None, experiment_key=None, online=None, mode=None) wandb=WandbConfig(enabled=False, group=None, team=None, project='deepspeed') csv_monitor=CSVConfig(enabled=False, output_path='', job_name='DeepSpeedJobName') enabled=False
[2024-09-09 16:39:48,959] [INFO] [config.py:1001:print]   nebula_config ................ {
    "enabled": false, 
    "persistent_storage_path": null, 
    "persistent_time_interval": 100, 
    "num_of_version_in_retention": 2, 
    "enable_nebula_load": true, 
    "load_path": null
}
[2024-09-09 16:39:48,959] [INFO] [config.py:1001:print]   optimizer_legacy_fusion ...... False
[2024-09-09 16:39:48,959] [INFO] [config.py:1001:print]   optimizer_name ............... None
[2024-09-09 16:39:48,959] [INFO] [config.py:1001:print]   optimizer_params ............. None
[2024-09-09 16:39:48,959] [INFO] [config.py:1001:print]   pipeline ..................... {'stages': 'auto', 'partition': 'best', 'seed_layers': False, 'activation_checkpoint_interval': 0, 'pipe_partitioned': True, 'grad_partitioned': True}
[2024-09-09 16:39:48,959] [INFO] [config.py:1001:print]   pld_enabled .................. False
[2024-09-09 16:39:48,959] [INFO] [config.py:1001:print]   pld_params ................... False
[2024-09-09 16:39:48,959] [INFO] [config.py:1001:print]   prescale_gradients ........... False
[2024-09-09 16:39:48,959] [INFO] [config.py:1001:print]   scheduler_name ............... None
[2024-09-09 16:39:48,959] [INFO] [config.py:1001:print]   scheduler_params ............. None
[2024-09-09 16:39:48,959] [INFO] [config.py:1001:print]   seq_parallel_communication_data_type  torch.float32
[2024-09-09 16:39:48,959] [INFO] [config.py:1001:print]   sparse_attention ............. None
[2024-09-09 16:39:48,960] [INFO] [config.py:1001:print]   sparse_gradients_enabled ..... False
[2024-09-09 16:39:48,960] [INFO] [config.py:1001:print]   steps_per_print .............. 50
[2024-09-09 16:39:48,960] [INFO] [config.py:1001:print]   timers_config ................ enabled=True synchronized=True
[2024-09-09 16:39:48,960] [INFO] [config.py:1001:print]   train_batch_size ............. 2
[2024-09-09 16:39:48,960] [INFO] [config.py:1001:print]   train_micro_batch_size_per_gpu  2
[2024-09-09 16:39:48,960] [INFO] [config.py:1001:print]   use_data_before_expert_parallel_  False
[2024-09-09 16:39:48,960] [INFO] [config.py:1001:print]   use_node_local_storage ....... False
[2024-09-09 16:39:48,960] [INFO] [config.py:1001:print]   wall_clock_breakdown ......... False
[2024-09-09 16:39:48,960] [INFO] [config.py:1001:print]   weight_quantization_config ... None
[2024-09-09 16:39:48,960] [INFO] [config.py:1001:print]   world_size ................... 1
[2024-09-09 16:39:48,960] [INFO] [config.py:1001:print]   zero_allow_untested_optimizer  True
[2024-09-09 16:39:48,960] [INFO] [config.py:1001:print]   zero_config .................. stage=2 contiguous_gradients=False reduce_scatter=True reduce_bucket_size=1000000000 use_multi_rank_bucket_allreduce=True allgather_partitions=True allgather_bucket_size=1000000000 overlap_comm=True load_from_fp32_weights=False elastic_checkpoint=False offload_param=None offload_optimizer=None sub_group_size=1,000,000,000 cpu_offload_param=None cpu_offload_use_pin_memory=None prefetch_bucket_size=50,000,000 param_persistence_threshold=100,000 model_persistence_threshold=sys.maxsize max_live_parameters=1,000,000,000 max_reuse_distance=1,000,000,000 gather_16bit_weights_on_model_save=False use_all_reduce_for_fetch_params=False stage3_gather_fp16_weights_on_model_save=False ignore_unused_parameters=True legacy_stage1=False round_robin_gradients=False zero_hpz_partition_size=1 zero_quantized_weights=False zero_quantized_nontrainable_weights=False zero_quantized_gradients=False mics_shard_size=-1 mics_hierarchical_params_gather=False memory_efficient_linear=True pipeline_loading_checkpoint=False override_module_apply=True
[2024-09-09 16:39:48,960] [INFO] [config.py:1001:print]   zero_enabled ................. True
[2024-09-09 16:39:48,960] [INFO] [config.py:1001:print]   zero_force_ds_cpu_optimizer .. True
[2024-09-09 16:39:48,960] [INFO] [config.py:1001:print]   zero_optimization_stage ...... 2
[2024-09-09 16:39:48,960] [INFO] [config.py:987:print_user_config]   json = {
    "train_micro_batch_size_per_gpu": 2, 
    "gradient_accumulation_steps": 1, 
    "steps_per_print": 50, 
    "gradient_clipping": 0.1, 
    "zero_optimization": {
        "stage": 2, 
        "cpu_offload": false, 
        "contiguous_gradients": false, 
        "overlap_comm": true, 
        "reduce_scatter": true, 
        "reduce_bucket_size": 1.000000e+09, 
        "allgather_bucket_size": 1.000000e+09, 
        "load_from_fp32_weights": false
    }, 
    "zero_allow_untested_optimizer": true, 
    "bf16": {
        "enabled": false
    }, 
    "fp16": {
        "enabled": true
    }, 
    "loss_scale": 0, 
    "loss_scale_window": 400, 
    "hysteresis": 2, 
    "min_loss_scale": 1, 
    "activation_checkpointing": {
        "partition_activations": false, 
        "contiguous_memory_optimization": false
    }, 
    "wall_clock_breakdown": false
}
[2024-09-09 16:39:48,960] [INFO] [RANK 0] learning rate decaying style linear, ratio 10.0
[2024-09-09 16:39:48,960] [INFO] [RANK 0] Finetuning Model...
[2024-09-09 16:39:48,960] [INFO] [RANK 0] arguments:
[2024-09-09 16:39:48,960] [INFO] [RANK 0]   base ......................... ['configs/cogvideox_2b_lora.yaml', 'configs/sft.yaml']
[2024-09-09 16:39:48,960] [INFO] [RANK 0]   model_parallel_size .......... 1
[2024-09-09 16:39:48,960] [INFO] [RANK 0]   force_pretrain ............... False
[2024-09-09 16:39:48,961] [INFO] [RANK 0]   device ....................... 0
[2024-09-09 16:39:48,961] [INFO] [RANK 0]   debug ........................ False
[2024-09-09 16:39:48,961] [INFO] [RANK 0]   log_image .................... True
[2024-09-09 16:39:48,961] [INFO] [RANK 0]   output_dir ................... samples
[2024-09-09 16:39:48,961] [INFO] [RANK 0]   input_dir .................... None
[2024-09-09 16:39:48,961] [INFO] [RANK 0]   input_type ................... cli
[2024-09-09 16:39:48,961] [INFO] [RANK 0]   input_file ................... input.txt
[2024-09-09 16:39:48,961] [INFO] [RANK 0]   final_size ................... 2048
[2024-09-09 16:39:48,961] [INFO] [RANK 0]   sdedit ....................... False
[2024-09-09 16:39:48,961] [INFO] [RANK 0]   grid_num_rows ................ 1
[2024-09-09 16:39:48,961] [INFO] [RANK 0]   force_inference .............. False
[2024-09-09 16:39:48,961] [INFO] [RANK 0]   lcm_steps .................... None
[2024-09-09 16:39:48,961] [INFO] [RANK 0]   sampling_num_frames .......... 32
[2024-09-09 16:39:48,961] [INFO] [RANK 0]   sampling_fps ................. 8
[2024-09-09 16:39:48,961] [INFO] [RANK 0]   only_save_latents ............ False
[2024-09-09 16:39:48,961] [INFO] [RANK 0]   only_log_video_latents ....... True
[2024-09-09 16:39:48,961] [INFO] [RANK 0]   latent_channels .............. 32
[2024-09-09 16:39:48,961] [INFO] [RANK 0]   image2video .................. False
[2024-09-09 16:39:48,961] [INFO] [RANK 0]   experiment_name .............. lora-test-09-09-16-39
[2024-09-09 16:39:48,961] [INFO] [RANK 0]   train_iters .................. 100
[2024-09-09 16:39:48,961] [INFO] [RANK 0]   batch_size ................... 2
[2024-09-09 16:39:48,961] [INFO] [RANK 0]   lr ........................... 0.001
[2024-09-09 16:39:48,961] [INFO] [RANK 0]   mode ......................... finetune
[2024-09-09 16:39:48,961] [INFO] [RANK 0]   seed ......................... 21247
[2024-09-09 16:39:48,961] [INFO] [RANK 0]   zero_stage ................... 0
[2024-09-09 16:39:48,961] [INFO] [RANK 0]   checkpoint_activations ....... True
[2024-09-09 16:39:48,961] [INFO] [RANK 0]   checkpoint_num_layers ........ 1
[2024-09-09 16:39:48,961] [INFO] [RANK 0]   checkpoint_skip_layers ....... 0
[2024-09-09 16:39:48,961] [INFO] [RANK 0]   fp16 ......................... True
[2024-09-09 16:39:48,961] [INFO] [RANK 0]   bf16 ......................... False
[2024-09-09 16:39:48,962] [INFO] [RANK 0]   gradient_accumulation_steps .. 1
[2024-09-09 16:39:48,962] [INFO] [RANK 0]   profiling .................... -1
[2024-09-09 16:39:48,962] [INFO] [RANK 0]   epochs ....................... None
[2024-09-09 16:39:48,962] [INFO] [RANK 0]   log_interval ................. 20
[2024-09-09 16:39:48,962] [INFO] [RANK 0]   summary_dir .................. 
[2024-09-09 16:39:48,962] [INFO] [RANK 0]   save_args .................... False
[2024-09-09 16:39:48,962] [INFO] [RANK 0]   lr_decay_iters ............... None
[2024-09-09 16:39:48,962] [INFO] [RANK 0]   lr_decay_style ............... linear
[2024-09-09 16:39:48,962] [INFO] [RANK 0]   lr_decay_ratio ............... 0.1
[2024-09-09 16:39:48,962] [INFO] [RANK 0]   warmup ....................... 0.01
[2024-09-09 16:39:48,962] [INFO] [RANK 0]   weight_decay ................. 0.0001
[2024-09-09 16:39:48,962] [INFO] [RANK 0]   save ......................... ckpts_2b_lora/lora-test-09-09-16-39
[2024-09-09 16:39:48,962] [INFO] [RANK 0]   load ......................... /root/CogVideo/CogVideoX-2b-sat/transformer
[2024-09-09 16:39:48,962] [INFO] [RANK 0]   force_train .................. True
[2024-09-09 16:39:48,962] [INFO] [RANK 0]   save_interval ................ 50
[2024-09-09 16:39:48,962] [INFO] [RANK 0]   no_save_rng .................. False
[2024-09-09 16:39:48,962] [INFO] [RANK 0]   no_load_rng .................. True
[2024-09-09 16:39:48,962] [INFO] [RANK 0]   resume_dataloader ............ False
[2024-09-09 16:39:48,962] [INFO] [RANK 0]   distributed_backend .......... nccl
[2024-09-09 16:39:48,962] [INFO] [RANK 0]   local_rank ................... 0
[2024-09-09 16:39:48,962] [INFO] [RANK 0]   exit_interval ................ None
[2024-09-09 16:39:48,962] [INFO] [RANK 0]   wandb ........................ False
[2024-09-09 16:39:48,962] [INFO] [RANK 0]   wandb_project_name ........... default_project
[2024-09-09 16:39:48,962] [INFO] [RANK 0]   eval_batch_size .............. 1
[2024-09-09 16:39:48,962] [INFO] [RANK 0]   eval_iters ................... 1
[2024-09-09 16:39:48,962] [INFO] [RANK 0]   eval_interval ................ 10
[2024-09-09 16:39:48,962] [INFO] [RANK 0]   strict_eval .................. False
[2024-09-09 16:39:48,962] [INFO] [RANK 0]   train_data ................... ['/root/CogVideo/sat/datasets/test']
[2024-09-09 16:39:48,962] [INFO] [RANK 0]   train_data_weights ........... None
[2024-09-09 16:39:48,962] [INFO] [RANK 0]   iterable_dataset ............. False
[2024-09-09 16:39:48,963] [INFO] [RANK 0]   iterable_dataset_eval ........ 
[2024-09-09 16:39:48,963] [INFO] [RANK 0]   batch_from_same_dataset ...... False
[2024-09-09 16:39:48,963] [INFO] [RANK 0]   valid_data ................... ['/root/CogVideo/sat/datasets/test']
[2024-09-09 16:39:48,963] [INFO] [RANK 0]   test_data .................... None
[2024-09-09 16:39:48,963] [INFO] [RANK 0]   split ........................ 1,0,0
[2024-09-09 16:39:48,963] [INFO] [RANK 0]   num_workers .................. 8
[2024-09-09 16:39:48,963] [INFO] [RANK 0]   block_size ................... 10000
[2024-09-09 16:39:48,963] [INFO] [RANK 0]   prefetch_factor .............. 4
[2024-09-09 16:39:48,963] [INFO] [RANK 0]   deepspeed .................... True
[2024-09-09 16:39:48,963] [INFO] [RANK 0]   deepspeed_config ............. {'train_micro_batch_size_per_gpu': 2, 'gradient_accumulation_steps': 1, 'steps_per_print': 50, 'gradient_clipping': 0.1, 'zero_optimization': {'stage': 2, 'cpu_offload': False, 'contiguous_gradients': False, 'overlap_comm': True, 'reduce_scatter': True, 'reduce_bucket_size': 1000000000, 'allgather_bucket_size': 1000000000, 'load_from_fp32_weights': False}, 'zero_allow_untested_optimizer': True, 'bf16': {'enabled': False}, 'fp16': {'enabled': True}, 'loss_scale': 0, 'loss_scale_window': 400, 'hysteresis': 2, 'min_loss_scale': 1, 'activation_checkpointing': {'partition_activations': False, 'contiguous_memory_optimization': False}, 'wall_clock_breakdown': False}
[2024-09-09 16:39:48,963] [INFO] [RANK 0]   deepscale .................... False
[2024-09-09 16:39:48,963] [INFO] [RANK 0]   deepscale_config ............. None
[2024-09-09 16:39:48,963] [INFO] [RANK 0]   model_config ................. {'scale_factor': 1.15258426, 'disable_first_stage_autocast': True, 'not_trainable_prefixes': ['all'], 'log_keys': ['txt'], 'denoiser_config': {'target': 'sgm.modules.diffusionmodules.denoiser.DiscreteDenoiser', 'params': {'num_idx': 1000, 'quantize_c_noise': False, 'weighting_config': {'target': 'sgm.modules.diffusionmodules.denoiser_weighting.EpsWeighting'}, 'scaling_config': {'target': 'sgm.modules.diffusionmodules.denoiser_scaling.VideoScaling'}, 'discretization_config': {'target': 'sgm.modules.diffusionmodules.discretizer.ZeroSNRDDPMDiscretization', 'params': {'shift_scale': 3.0}}}}, 'network_config': {'target': 'dit_video_concat.DiffusionTransformer', 'params': {'time_embed_dim': 512, 'elementwise_affine': True, 'num_frames': 49, 'time_compressed_rate': 4, 'latent_width': 90, 'latent_height': 60, 'num_layers': 30, 'patch_size': 2, 'in_channels': 16, 'out_channels': 16, 'hidden_size': 1920, 'adm_in_channels': 256, 'num_attention_heads': 30, 'transformer_args': {'checkpoint_activations': True, 'vocab_size': 1, 'max_sequence_length': 64, 'layernorm_order': 'pre', 'skip_init': False, 'model_parallel_size': 1, 'is_decoder': False, 'num_layers': 30, 'hidden_size': 1920, 'num_attention_heads': 30, 'parallel_output': True}, 'modules': {'pos_embed_config': {'target': 'dit_video_concat.Basic3DPositionEmbeddingMixin', 'params': {'text_length': 226, 'height_interpolation': 1.875, 'width_interpolation': 1.875}}, 'lora_config': {'target': 'sat.model.finetune.lora2.LoraMixin', 'params': {'r': 128}}, 'patch_embed_config': {'target': 'dit_video_concat.ImagePatchEmbeddingMixin', 'params': {'text_hidden_size': 4096}}, 'adaln_layer_config': {'target': 'dit_video_concat.AdaLNMixin', 'params': {'qk_ln': True}}, 'final_layer_config': {'target': 'dit_video_concat.FinalLayerMixin'}}, 'dtype': 'fp16'}}, 'conditioner_config': {'target': 'sgm.modules.GeneralConditioner', 'params': {'emb_models': [{'is_trainable': False, 'input_key': 'txt', 'ucg_rate': 0.1, 'target': 'sgm.modules.encoders.modules.FrozenT5Embedder', 'params': {'model_dir': '/root/CogVideo/t5-v1_1-xxl', 'max_length': 226}}]}}, 'first_stage_config': {'target': 'vae_modules.autoencoder.VideoAutoencoderInferenceWrapper', 'params': {'cp_size': 1, 'ckpt_path': '/root/CogVideo/CogVideoX-2b-sat/vae/3d-vae.pt', 'ignore_keys': ['loss'], 'loss_config': {'target': 'torch.nn.Identity'}, 'regularizer_config': {'target': 'vae_modules.regularizers.DiagonalGaussianRegularizer'}, 'encoder_config': {'target': 'vae_modules.cp_enc_dec.ContextParallelEncoder3D', 'params': {'double_z': True, 'z_channels': 16, 'resolution': 256, 'in_channels': 3, 'out_ch': 3, 'ch': 128, 'ch_mult': [1, 2, 2, 4], 'attn_resolutions': [], 'num_res_blocks': 3, 'dropout': 0.0, 'gather_norm': True}}, 'decoder_config': {'target': 'vae_modules.cp_enc_dec.ContextParallelDecoder3D', 'params': {'double_z': True, 'z_channels': 16, 'resolution': 256, 'in_channels': 3, 'out_ch': 3, 'ch': 128, 'ch_mult': [1, 2, 2, 4], 'attn_resolutions': [], 'num_res_blocks': 3, 'dropout': 0.0, 'gather_norm': False}}}}, 'loss_fn_config': {'target': 'sgm.modules.diffusionmodules.loss.VideoDiffusionLoss', 'params': {'offset_noise_level': 0, 'sigma_sampler_config': {'target': 'sgm.modules.diffusionmodules.sigma_sampling.DiscreteSampling', 'params': {'uniform_sampling': True, 'num_idx': 1000, 'discretization_config': {'target': 'sgm.modules.diffusionmodules.discretizer.ZeroSNRDDPMDiscretization', 'params': {'shift_scale': 3.0}}}}}}, 'sampler_config': {'target': 'sgm.modules.diffusionmodules.sampling.VPSDEDPMPP2MSampler', 'params': {'num_steps': 50, 'verbose': True, 'discretization_config': {'target': 'sgm.modules.diffusionmodules.discretizer.ZeroSNRDDPMDiscretization', 'params': {'shift_scale': 3.0}}, 'guider_config': {'target': 'sgm.modules.diffusionmodules.guiders.DynamicCFG', 'params': {'scale': 6, 'exp': 5, 'num_steps': 50}}}}}
[2024-09-09 16:39:48,963] [INFO] [RANK 0]   data_config .................. {'target': 'data_video.SFTDataset', 'params': {'video_size': [480, 720], 'fps': 8, 'max_num_frames': 49, 'skip_frms_num': 3.0}}
[2024-09-09 16:39:48,963] [INFO] [RANK 0]   cuda ......................... True
[2024-09-09 16:39:48,963] [INFO] [RANK 0]   rank ......................... 0
[2024-09-09 16:39:48,963] [INFO] [RANK 0]   world_size ................... 1
[2024-09-09 16:39:48,964] [INFO] [RANK 0]   deepspeed_activation_checkpointing  True
[2024-09-09 16:39:48,964] [INFO] [RANK 0]   master_ip .................... localhost
[2024-09-09 16:39:48,964] [INFO] [RANK 0]   master_port .................. 39375
[2024-09-09 16:39:48,964] [INFO] [RANK 0]   log_config ................... [{'model': {'scale_factor': 1.15258426, 'disable_first_stage_autocast': True, 'not_trainable_prefixes': ['all'], 'log_keys': ['txt'], 'denoiser_config': {'target': 'sgm.modules.diffusionmodules.denoiser.DiscreteDenoiser', 'params': {'num_idx': 1000, 'quantize_c_noise': False, 'weighting_config': {'target': 'sgm.modules.diffusionmodules.denoiser_weighting.EpsWeighting'}, 'scaling_config': {'target': 'sgm.modules.diffusionmodules.denoiser_scaling.VideoScaling'}, 'discretization_config': {'target': 'sgm.modules.diffusionmodules.discretizer.ZeroSNRDDPMDiscretization', 'params': {'shift_scale': 3.0}}}}, 'network_config': {'target': 'dit_video_concat.DiffusionTransformer', 'params': {'time_embed_dim': 512, 'elementwise_affine': True, 'num_frames': 49, 'time_compressed_rate': 4, 'latent_width': 90, 'latent_height': 60, 'num_layers': 30, 'patch_size': 2, 'in_channels': 16, 'out_channels': 16, 'hidden_size': 1920, 'adm_in_channels': 256, 'num_attention_heads': 30, 'transformer_args': {'checkpoint_activations': True, 'vocab_size': 1, 'max_sequence_length': 64, 'layernorm_order': 'pre', 'skip_init': False, 'model_parallel_size': 1, 'is_decoder': False}, 'modules': {'pos_embed_config': {'target': 'dit_video_concat.Basic3DPositionEmbeddingMixin', 'params': {'text_length': 226, 'height_interpolation': 1.875, 'width_interpolation': 1.875}}, 'lora_config': {'target': 'sat.model.finetune.lora2.LoraMixin', 'params': {'r': 128}}, 'patch_embed_config': {'target': 'dit_video_concat.ImagePatchEmbeddingMixin', 'params': {'text_hidden_size': 4096}}, 'adaln_layer_config': {'target': 'dit_video_concat.AdaLNMixin', 'params': {'qk_ln': True}}, 'final_layer_config': {'target': 'dit_video_concat.FinalLayerMixin'}}}}, 'conditioner_config': {'target': 'sgm.modules.GeneralConditioner', 'params': {'emb_models': [{'is_trainable': False, 'input_key': 'txt', 'ucg_rate': 0.1, 'target': 'sgm.modules.encoders.modules.FrozenT5Embedder', 'params': {'model_dir': '/root/CogVideo/t5-v1_1-xxl', 'max_length': 226}}]}}, 'first_stage_config': {'target': 'vae_modules.autoencoder.VideoAutoencoderInferenceWrapper', 'params': {'cp_size': 1, 'ckpt_path': '/root/CogVideo/CogVideoX-2b-sat/vae/3d-vae.pt', 'ignore_keys': ['loss'], 'loss_config': {'target': 'torch.nn.Identity'}, 'regularizer_config': {'target': 'vae_modules.regularizers.DiagonalGaussianRegularizer'}, 'encoder_config': {'target': 'vae_modules.cp_enc_dec.ContextParallelEncoder3D', 'params': {'double_z': True, 'z_channels': 16, 'resolution': 256, 'in_channels': 3, 'out_ch': 3, 'ch': 128, 'ch_mult': [1, 2, 2, 4], 'attn_resolutions': [], 'num_res_blocks': 3, 'dropout': 0.0, 'gather_norm': True}}, 'decoder_config': {'target': 'vae_modules.cp_enc_dec.ContextParallelDecoder3D', 'params': {'double_z': True, 'z_channels': 16, 'resolution': 256, 'in_channels': 3, 'out_ch': 3, 'ch': 128, 'ch_mult': [1, 2, 2, 4], 'attn_resolutions': [], 'num_res_blocks': 3, 'dropout': 0.0, 'gather_norm': False}}}}, 'loss_fn_config': {'target': 'sgm.modules.diffusionmodules.loss.VideoDiffusionLoss', 'params': {'offset_noise_level': 0, 'sigma_sampler_config': {'target': 'sgm.modules.diffusionmodules.sigma_sampling.DiscreteSampling', 'params': {'uniform_sampling': True, 'num_idx': 1000, 'discretization_config': {'target': 'sgm.modules.diffusionmodules.discretizer.ZeroSNRDDPMDiscretization', 'params': {'shift_scale': 3.0}}}}}}, 'sampler_config': {'target': 'sgm.modules.diffusionmodules.sampling.VPSDEDPMPP2MSampler', 'params': {'num_steps': 50, 'verbose': True, 'discretization_config': {'target': 'sgm.modules.diffusionmodules.discretizer.ZeroSNRDDPMDiscretization', 'params': {'shift_scale': 3.0}}, 'guider_config': {'target': 'sgm.modules.diffusionmodules.guiders.DynamicCFG', 'params': {'scale': 6, 'exp': 5, 'num_steps': 50}}}}}}, {'args': {'checkpoint_activations': True, 'model_parallel_size': 1, 'experiment_name': 'lora-test', 'mode': 'finetune', 'load': '/root/CogVideo/CogVideoX-2b-sat/transformer', 'no_load_rng': True, 'train_iters': 100, 'eval_iters': 1, 'eval_interval': 10, 'eval_batch_size': 1, 'save': 'ckpts_2b_lora', 'save_interval': 50, 'log_interval': 20, 'train_data': ['/root/CogVideo/sat/datasets/test'], 'valid_data': ['/root/CogVideo/sat/datasets/test'], 'split': '1,0,0', 'num_workers': 8, 'force_train': True, 'only_log_video_latents': True}, 'data': {'target': 'data_video.SFTDataset', 'params': {'video_size': [480, 720], 'fps': 8, 'max_num_frames': 49, 'skip_frms_num': 3.0}}, 'deepspeed': {'train_micro_batch_size_per_gpu': 2, 'gradient_accumulation_steps': 1, 'steps_per_print': 50, 'gradient_clipping': 0.1, 'zero_optimization': {'stage': 2, 'cpu_offload': False, 'contiguous_gradients': False, 'overlap_comm': True, 'reduce_scatter': True, 'reduce_bucket_size': 1000000000, 'allgather_bucket_size': 1000000000, 'load_from_fp32_weights': False}, 'zero_allow_untested_optimizer': True, 'bf16': {'enabled': False}, 'fp16': {'enabled': True}, 'loss_scale': 0, 'loss_scale_window': 400, 'hysteresis': 2, 'min_loss_scale': 1, 'optimizer': {'type': 'sat.ops.FusedEmaAdam', 'params': {'lr': 0.001, 'betas': [0.9, 0.95], 'eps': '1e-8', 'weight_decay': '1e-4'}}, 'activation_checkpointing': {'partition_activations': False, 'contiguous_memory_optimization': False}, 'wall_clock_breakdown': False}}]
[2024-09-09 16:39:48,964] [INFO] [RANK 0]   do_train ..................... True
[2024-09-09 16:39:48,964] [INFO] [RANK 0]   val_last_shape ............... []
[2024-09-09 16:39:48,964] [INFO] [RANK 0]   val_drop_number .............. 0
[2024-09-09 16:39:48,964] [INFO] [RANK 0]   do_valid ..................... True
[2024-09-09 16:39:48,964] [INFO] [RANK 0]   do_test ...................... False
[2024-09-09 16:39:48,964] [INFO] [RANK 0]   iteration .................... 0
[2024-09-09 16:40:39,276] [INFO] [checkpointing.py:541:forward] Activation Checkpointing Information
[2024-09-09 16:40:39,276] [INFO] [checkpointing.py:542:forward] ----Partition Activations False, CPU CHECKPOINTING False
[2024-09-09 16:40:39,276] [INFO] [checkpointing.py:543:forward] ----contiguous Memory Checkpointing False with None total layers
[2024-09-09 16:40:39,276] [INFO] [checkpointing.py:545:forward] ----Synchronization False
[2024-09-09 16:40:39,276] [INFO] [checkpointing.py:546:forward] ----Profiling time in checkpointing False
[2024-09-09 16:40:49,239] [INFO] [loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 4294967296, reducing to 2147483648
[2024-09-09 16:41:14,908] [INFO] [loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 2147483648, reducing to 1073741824
[rank0]: Traceback (most recent call last):
[rank0]:   File "/root/CogVideo/sat/train_video.py", line 226, in <module>
[rank0]:     training_main(
[rank0]:   File "/root/miniconda3/envs/cogvideo/lib/python3.12/site-packages/sat/training/deepspeed_training.py", line 157, in training_main
[rank0]:     iteration, skipped = train(model, optimizer,
[rank0]:                          ^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/root/miniconda3/envs/cogvideo/lib/python3.12/site-packages/sat/training/deepspeed_training.py", line 359, in train
[rank0]:     lm_loss, skipped_iter, metrics = train_step(train_data_iterator,
[rank0]:                                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/root/miniconda3/envs/cogvideo/lib/python3.12/site-packages/sat/training/deepspeed_training.py", line 443, in train_step
[rank0]:     forward_ret = forward_step(data_iterator, model, args, timers, **kwargs)
[rank0]:                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/root/CogVideo/sat/train_video.py", line 176, in forward_step
[rank0]:     batch = next(data_iterator)
[rank0]:             ^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/root/miniconda3/envs/cogvideo/lib/python3.12/site-packages/torch/utils/data/dataloader.py", line 630, in __next__
[rank0]:     data = self._next_data()
[rank0]:            ^^^^^^^^^^^^^^^^^
[rank0]:   File "/root/miniconda3/envs/cogvideo/lib/python3.12/site-packages/torch/utils/data/dataloader.py", line 1324, in _next_data
[rank0]:     return self._process_data(data)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/root/miniconda3/envs/cogvideo/lib/python3.12/site-packages/torch/utils/data/dataloader.py", line 1370, in _process_data
[rank0]:     data.reraise()
[rank0]:   File "/root/miniconda3/envs/cogvideo/lib/python3.12/site-packages/torch/_utils.py", line 706, in reraise
[rank0]:     raise exception
[rank0]: ZeroDivisionError: Caught ZeroDivisionError in DataLoader worker process 2.
[rank0]: Original Traceback (most recent call last):
[rank0]:   File "/root/miniconda3/envs/cogvideo/lib/python3.12/site-packages/torch/utils/data/_utils/worker.py", line 309, in _worker_loop
[rank0]:     data = fetcher.fetch(index)  # type: ignore[possibly-undefined]
[rank0]:            ^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/root/miniconda3/envs/cogvideo/lib/python3.12/site-packages/torch/utils/data/_utils/fetch.py", line 52, in fetch
[rank0]:     data = [self.dataset[idx] for idx in possibly_batched_index]
[rank0]:             ~~~~~~~~~~~~^^^^^
[rank0]:   File "/root/miniconda3/envs/cogvideo/lib/python3.12/site-packages/sat/data_utils/configure_data.py", line 360, in __getitem__
[rank0]:     return self.wrapped_data[index]
[rank0]:            ~~~~~~~~~~~~~~~~~^^^^^^^
[rank0]:   File "/root/miniconda3/envs/cogvideo/lib/python3.12/site-packages/sat/data_utils/configure_data.py", line 342, in __getitem__
[rank0]:     return self.datasets[dataset_idx][sample_idx]
[rank0]:            ~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^
[rank0]:   File "/root/CogVideo/sat/data_video.py", line 411, in __getitem__
[rank0]:     indices = np.arange(start, end, (end - start) // num_frames).astype(int)
[rank0]:               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: ZeroDivisionError: division by zero

[2nd Trial] Selecting videos with no more than 50 frames
(cogvideo) root@alphacode-ttv-a100-80g-gpu:~/CogVideo/sat# bash finetune_single_gpu.sh 
RUN on alphacode-ttv-a100-80g-gpu, CUDA_VISIBLE_DEVICES=
WORLD_SIZE=1 RANK=0 LOCAL_RANK=0 LOCAL_WORLD_SIZE=1 python train_video.py --base configs/cogvideox_2b_lora.yaml configs/sft.yaml --seed 5243
[2024-09-09 16:57:30,500] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
 [WARNING]  async_io requires the dev libaio .so object and headers but these were not found.
 [WARNING]  async_io: please install the libaio-dev package with apt
 [WARNING]  If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
 [WARNING]  Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
 [WARNING]  sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.4
 [WARNING]  using untested triton version (3.0.0), only 1.0.0 is known to be compatible
/root/miniconda3/envs/cogvideo/lib/python3.12/site-packages/deepspeed/runtime/zero/linear.py:47: FutureWarning: `torch.cuda.amp.custom_fwd(args...)` is deprecated. Please use `torch.amp.custom_fwd(args..., device_type='cuda')` instead.
  @autocast_custom_fwd
/root/miniconda3/envs/cogvideo/lib/python3.12/site-packages/deepspeed/runtime/zero/linear.py:66: FutureWarning: `torch.cuda.amp.custom_bwd(args...)` is deprecated. Please use `torch.amp.custom_bwd(args..., device_type='cuda')` instead.
  @autocast_custom_bwd
/root/miniconda3/envs/cogvideo/lib/python3.12/site-packages/kornia/feature/lightglue.py:44: FutureWarning: `torch.cuda.amp.custom_fwd(args...)` is deprecated. Please use `torch.amp.custom_fwd(args..., device_type='cuda')` instead.
  @torch.cuda.amp.custom_fwd(cast_inputs=torch.float32)
no module 'xformers'. Processing without...
no module 'xformers'. Processing without...
[2024-09-09 16:57:35,259] [INFO] using world size: 1
[2024-09-09 16:57:35,259] [INFO] Will override arguments with manually specified deepspeed_config!
[W909 16:57:35.100558963 socket.cpp:697] [c10d] The client socket cannot be initialized to connect to [ip6-localhost]:57495 (errno: 97 - Address family not supported by protocol).
[W909 16:57:35.104642776 socket.cpp:697] [c10d] The client socket cannot be initialized to connect to [alphacode-ttv-a100-80g-gpu]:57495 (errno: 97 - Address family not supported by protocol).
[2024-09-09 16:57:35,282] [INFO] [RANK 0] > initializing model parallel with size 1
[2024-09-09 16:57:35,283] [INFO] [comm.py:637:init_distributed] cdb=None
[2024-09-09 16:57:35,516] [INFO] [RANK 0] building SATVideoDiffusionEngine model ...
[2024-09-09 16:57:44,744] [WARNING] [RANK 0] Failed to load bitsandbytes:No module named 'bitsandbytes'
[2024-09-09 16:57:44,744] [INFO] [RANK 0] replacing layer 0 attention with lora
[2024-09-09 16:57:44,781] [INFO] [RANK 0] replacing layer 1 attention with lora
[2024-09-09 16:57:44,816] [INFO] [RANK 0] replacing layer 2 attention with lora
[2024-09-09 16:57:44,841] [INFO] [RANK 0] replacing layer 3 attention with lora
[2024-09-09 16:57:44,863] [INFO] [RANK 0] replacing layer 4 attention with lora
[2024-09-09 16:57:44,885] [INFO] [RANK 0] replacing layer 5 attention with lora
[2024-09-09 16:57:44,907] [INFO] [RANK 0] replacing layer 6 attention with lora
[2024-09-09 16:57:44,982] [INFO] [RANK 0] replacing layer 7 attention with lora
[2024-09-09 16:57:45,090] [INFO] [RANK 0] replacing layer 8 attention with lora
[2024-09-09 16:57:45,159] [INFO] [RANK 0] replacing layer 9 attention with lora
[2024-09-09 16:57:45,273] [INFO] [RANK 0] replacing layer 10 attention with lora
[2024-09-09 16:57:45,422] [INFO] [RANK 0] replacing layer 11 attention with lora
[2024-09-09 16:57:45,550] [INFO] [RANK 0] replacing layer 12 attention with lora
[2024-09-09 16:57:45,658] [INFO] [RANK 0] replacing layer 13 attention with lora
[2024-09-09 16:57:45,774] [INFO] [RANK 0] replacing layer 14 attention with lora
[2024-09-09 16:57:45,905] [INFO] [RANK 0] replacing layer 15 attention with lora
[2024-09-09 16:57:46,027] [INFO] [RANK 0] replacing layer 16 attention with lora
[2024-09-09 16:57:46,102] [INFO] [RANK 0] replacing layer 17 attention with lora
[2024-09-09 16:57:46,195] [INFO] [RANK 0] replacing layer 18 attention with lora
[2024-09-09 16:57:46,302] [INFO] [RANK 0] replacing layer 19 attention with lora
[2024-09-09 16:57:46,347] [INFO] [RANK 0] replacing layer 20 attention with lora
[2024-09-09 16:57:46,375] [INFO] [RANK 0] replacing layer 21 attention with lora
[2024-09-09 16:57:46,397] [INFO] [RANK 0] replacing layer 22 attention with lora
[2024-09-09 16:57:46,419] [INFO] [RANK 0] replacing layer 23 attention with lora
[2024-09-09 16:57:46,440] [INFO] [RANK 0] replacing layer 24 attention with lora
[2024-09-09 16:57:46,461] [INFO] [RANK 0] replacing layer 25 attention with lora
[2024-09-09 16:57:46,483] [INFO] [RANK 0] replacing layer 26 attention with lora
[2024-09-09 16:57:46,504] [INFO] [RANK 0] replacing layer 27 attention with lora
[2024-09-09 16:57:46,526] [INFO] [RANK 0] replacing layer 28 attention with lora
[2024-09-09 16:57:46,547] [INFO] [RANK 0] replacing layer 29 attention with lora
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:02<00:00,  1.01s/it]
Initialized embedder #0: FrozenT5Embedder with 4762310656 params. Trainable: False
Working with z of shape (1, 16, 32, 32) = 16384 dimensions.
/root/CogVideo/sat/vae_modules/autoencoder.py:565: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature.
  sd = torch.load(path, map_location="cpu")["state_dict"]
Deleting key loss.logvar from state_dict.
Deleting key loss.perceptual_loss.scaling_layer.shift from state_dict.
Deleting key loss.perceptual_loss.scaling_layer.scale from state_dict.
Deleting key loss.perceptual_loss.net.slice1.0.weight from state_dict.
Deleting key loss.perceptual_loss.net.slice1.0.bias from state_dict.
Deleting key loss.perceptual_loss.net.slice1.2.weight from state_dict.
Deleting key loss.perceptual_loss.net.slice1.2.bias from state_dict.
Deleting key loss.perceptual_loss.net.slice2.5.weight from state_dict.
Deleting key loss.perceptual_loss.net.slice2.5.bias from state_dict.
Deleting key loss.perceptual_loss.net.slice2.7.weight from state_dict.
Deleting key loss.perceptual_loss.net.slice2.7.bias from state_dict.
Deleting key loss.perceptual_loss.net.slice3.10.weight from state_dict.
Deleting key loss.perceptual_loss.net.slice3.10.bias from state_dict.
Deleting key loss.perceptual_loss.net.slice3.12.weight from state_dict.
Deleting key loss.perceptual_loss.net.slice3.12.bias from state_dict.
Deleting key loss.perceptual_loss.net.slice3.14.weight from state_dict.
Deleting key loss.perceptual_loss.net.slice3.14.bias from state_dict.
Deleting key loss.perceptual_loss.net.slice4.17.weight from state_dict.
Deleting key loss.perceptual_loss.net.slice4.17.bias from state_dict.
Deleting key loss.perceptual_loss.net.slice4.19.weight from state_dict.
Deleting key loss.perceptual_loss.net.slice4.19.bias from state_dict.
Deleting key loss.perceptual_loss.net.slice4.21.weight from state_dict.
Deleting key loss.perceptual_loss.net.slice4.21.bias from state_dict.
Deleting key loss.perceptual_loss.net.slice5.24.weight from state_dict.
Deleting key loss.perceptual_loss.net.slice5.24.bias from state_dict.
Deleting key loss.perceptual_loss.net.slice5.26.weight from state_dict.
Deleting key loss.perceptual_loss.net.slice5.26.bias from state_dict.
Deleting key loss.perceptual_loss.net.slice5.28.weight from state_dict.
Deleting key loss.perceptual_loss.net.slice5.28.bias from state_dict.
Deleting key loss.perceptual_loss.lin0.model.1.weight from state_dict.
Deleting key loss.perceptual_loss.lin1.model.1.weight from state_dict.
Deleting key loss.perceptual_loss.lin2.model.1.weight from state_dict.
Deleting key loss.perceptual_loss.lin3.model.1.weight from state_dict.
Deleting key loss.perceptual_loss.lin4.model.1.weight from state_dict.
Deleting key loss.discriminator.blocks.0.downsample_res.conv.weight from state_dict.
Deleting key loss.discriminator.blocks.0.downsample_res.conv.bias from state_dict.
Deleting key loss.discriminator.blocks.0.net.0.conv.weight from state_dict.
Deleting key loss.discriminator.blocks.0.net.0.conv.bias from state_dict.
Deleting key loss.discriminator.blocks.0.net.2.conv.weight from state_dict.
Deleting key loss.discriminator.blocks.0.net.2.conv.bias from state_dict.
Deleting key loss.discriminator.blocks.0.downsample.conv.weight from state_dict.
Deleting key loss.discriminator.blocks.0.downsample.conv.bias from state_dict.
Deleting key loss.discriminator.blocks.1.downsample_res.conv.weight from state_dict.
Deleting key loss.discriminator.blocks.1.downsample_res.conv.bias from state_dict.
Deleting key loss.discriminator.blocks.1.net.0.conv.weight from state_dict.
Deleting key loss.discriminator.blocks.1.net.0.conv.bias from state_dict.
Deleting key loss.discriminator.blocks.1.net.2.conv.weight from state_dict.
Deleting key loss.discriminator.blocks.1.net.2.conv.bias from state_dict.
Deleting key loss.discriminator.blocks.1.downsample.conv.weight from state_dict.
Deleting key loss.discriminator.blocks.1.downsample.conv.bias from state_dict.
Deleting key loss.discriminator.blocks.2.downsample_res.conv.weight from state_dict.
Deleting key loss.discriminator.blocks.2.downsample_res.conv.bias from state_dict.
Deleting key loss.discriminator.blocks.2.net.0.conv.weight from state_dict.
Deleting key loss.discriminator.blocks.2.net.0.conv.bias from state_dict.
Deleting key loss.discriminator.blocks.2.net.2.conv.weight from state_dict.
Deleting key loss.discriminator.blocks.2.net.2.conv.bias from state_dict.
Deleting key loss.discriminator.blocks.2.downsample.conv.weight from state_dict.
Deleting key loss.discriminator.blocks.2.downsample.conv.bias from state_dict.
Deleting key loss.discriminator.blocks.3.downsample_res.conv.weight from state_dict.
Deleting key loss.discriminator.blocks.3.downsample_res.conv.bias from state_dict.
Deleting key loss.discriminator.blocks.3.net.0.conv.weight from state_dict.
Deleting key loss.discriminator.blocks.3.net.0.conv.bias from state_dict.
Deleting key loss.discriminator.blocks.3.net.2.conv.weight from state_dict.
Deleting key loss.discriminator.blocks.3.net.2.conv.bias from state_dict.
Deleting key loss.discriminator.blocks.3.downsample.conv.weight from state_dict.
Deleting key loss.discriminator.blocks.3.downsample.conv.bias from state_dict.
Deleting key loss.discriminator.blocks.4.0.conv_res.weight from state_dict.
Deleting key loss.discriminator.blocks.4.0.conv_res.bias from state_dict.
Deleting key loss.discriminator.blocks.4.0.net.0.weight from state_dict.
Deleting key loss.discriminator.blocks.4.0.net.0.bias from state_dict.
Deleting key loss.discriminator.blocks.4.0.net.2.weight from state_dict.
Deleting key loss.discriminator.blocks.4.0.net.2.bias from state_dict.
Deleting key loss.discriminator.blocks.4.0.downsample.1.weight from state_dict.
Deleting key loss.discriminator.blocks.4.0.downsample.1.bias from state_dict.
Deleting key loss.discriminator.blocks.4.1.0.fn.norm.gamma from state_dict.
Deleting key loss.discriminator.blocks.4.1.0.fn.attn.to_q.0.weight from state_dict.
Deleting key loss.discriminator.blocks.4.1.0.fn.attn.to_kv.0.weight from state_dict.
Deleting key loss.discriminator.blocks.4.1.0.fn.attn.to_out.0.weight from state_dict.
Deleting key loss.discriminator.blocks.4.1.1.fn.norm.gamma from state_dict.
Deleting key loss.discriminator.blocks.4.1.1.fn.net.0.weight from state_dict.
Deleting key loss.discriminator.blocks.4.1.1.fn.net.0.bias from state_dict.
Deleting key loss.discriminator.blocks.4.1.1.fn.net.2.weight from state_dict.
Deleting key loss.discriminator.blocks.4.1.1.fn.net.2.bias from state_dict.
Deleting key loss.discriminator.blocks.5.0.conv_res.weight from state_dict.
Deleting key loss.discriminator.blocks.5.0.conv_res.bias from state_dict.
Deleting key loss.discriminator.blocks.5.0.net.0.weight from state_dict.
Deleting key loss.discriminator.blocks.5.0.net.0.bias from state_dict.
Deleting key loss.discriminator.blocks.5.0.net.2.weight from state_dict.
Deleting key loss.discriminator.blocks.5.0.net.2.bias from state_dict.
Deleting key loss.discriminator.blocks.5.0.downsample.1.weight from state_dict.
Deleting key loss.discriminator.blocks.5.0.downsample.1.bias from state_dict.
Deleting key loss.discriminator.blocks.5.1.0.fn.norm.gamma from state_dict.
Deleting key loss.discriminator.blocks.5.1.0.fn.attn.to_q.0.weight from state_dict.
Deleting key loss.discriminator.blocks.5.1.0.fn.attn.to_kv.0.weight from state_dict.
Deleting key loss.discriminator.blocks.5.1.0.fn.attn.to_out.0.weight from state_dict.
Deleting key loss.discriminator.blocks.5.1.1.fn.norm.gamma from state_dict.
Deleting key loss.discriminator.blocks.5.1.1.fn.net.0.weight from state_dict.
Deleting key loss.discriminator.blocks.5.1.1.fn.net.0.bias from state_dict.
Deleting key loss.discriminator.blocks.5.1.1.fn.net.2.weight from state_dict.
Deleting key loss.discriminator.blocks.5.1.1.fn.net.2.bias from state_dict.
Deleting key loss.discriminator.blocks.6.0.conv_res.weight from state_dict.
Deleting key loss.discriminator.blocks.6.0.conv_res.bias from state_dict.
Deleting key loss.discriminator.blocks.6.0.net.0.weight from state_dict.
Deleting key loss.discriminator.blocks.6.0.net.0.bias from state_dict.
Deleting key loss.discriminator.blocks.6.0.net.2.weight from state_dict.
Deleting key loss.discriminator.blocks.6.0.net.2.bias from state_dict.
Deleting key loss.discriminator.blocks.6.1.0.fn.norm.gamma from state_dict.
Deleting key loss.discriminator.blocks.6.1.0.fn.attn.to_q.0.weight from state_dict.
Deleting key loss.discriminator.blocks.6.1.0.fn.attn.to_kv.0.weight from state_dict.
Deleting key loss.discriminator.blocks.6.1.0.fn.attn.to_out.0.weight from state_dict.
Deleting key loss.discriminator.blocks.6.1.1.fn.norm.gamma from state_dict.
Deleting key loss.discriminator.blocks.6.1.1.fn.net.0.weight from state_dict.
Deleting key loss.discriminator.blocks.6.1.1.fn.net.0.bias from state_dict.
Deleting key loss.discriminator.blocks.6.1.1.fn.net.2.weight from state_dict.
Deleting key loss.discriminator.blocks.6.1.1.fn.net.2.bias from state_dict.
Deleting key loss.discriminator.to_logits.0.weight from state_dict.
Deleting key loss.discriminator.to_logits.0.bias from state_dict.
Deleting key loss.discriminator.to_logits.3.weight from state_dict.
Deleting key loss.discriminator.to_logits.3.bias from state_dict.
Missing keys:  []
Unexpected keys:  []
Restored from /root/CogVideo/CogVideoX-2b-sat/vae/3d-vae.pt
[2024-09-09 16:57:50,806] [INFO] [RANK 0]  > number of parameters on model parallel rank 0: 6764790755
[2024-09-09 16:58:00,971] [INFO] [RANK 0] global rank 0 is loading checkpoint /root/CogVideo/CogVideoX-2b-sat/transformer/1000/mp_rank_00_model_states.pt
/root/miniconda3/envs/cogvideo/lib/python3.12/site-packages/sat/training/model_io.py:286: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature.
  sd = torch.load(checkpoint_name, map_location='cpu')
[2024-09-09 16:58:02,528] [INFO] [RANK 0] > successfully loaded /root/CogVideo/CogVideoX-2b-sat/transformer/1000/mp_rank_00_model_states.pt
[2024-09-09 16:58:03,506] [INFO] [RANK 0] ***** Total trainable parameters: 58982400 *****
[2024-09-09 16:58:03,506] [INFO] [RANK 0] [<class 'sat.ops.layernorm.LayerNorm'>, <class 'torch.nn.modules.normalization.LayerNorm'>, <class 'sat.ops.layernorm.RMSNorm'>] is set to no_weight_decay
[2024-09-09 16:58:03,509] [INFO] [RANK 0] Syncing initialized parameters...
[2024-09-09 16:58:03,623] [INFO] [RANK 0] Finished syncing initialized parameters.
[2024-09-09 16:58:03,624] [INFO] [RANK 0] Using optimizer sat.ops.FusedEmaAdam from sat.
[2024-09-09 16:58:03,624] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed info: version=0.14.4, git-hash=unknown, git-branch=unknown
[2024-09-09 16:58:03,625] [WARNING] [config_utils.py:69:_process_deprecated_field] Config parameter cpu_offload is deprecated use offload_optimizer instead
[2024-09-09 16:58:03,717] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Flops Profiler Enabled: False
Using /root/.cache/torch_extensions/py312_cu121 as PyTorch extensions root...
Detected CUDA files, patching ldflags
Emitting ninja build file /root/.cache/torch_extensions/py312_cu121/fused_ema_adam/build.ninja...
/root/miniconda3/envs/cogvideo/lib/python3.12/site-packages/torch/utils/cpp_extension.py:1965: UserWarning: TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'].
  warnings.warn(
Building extension module fused_ema_adam...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module fused_ema_adam...
Time to load fused_ema_adam op: 0.6912670135498047 seconds
[2024-09-09 16:58:04,567] [INFO] [logging.py:96:log_dist] [Rank 0] Using client callable to create basic optimizer
[2024-09-09 16:58:04,567] [INFO] [logging.py:96:log_dist] [Rank 0] Removing param_group that has no 'params' in the basic Optimizer
[2024-09-09 16:58:04,587] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Basic Optimizer = FusedEmaAdam
[2024-09-09 16:58:04,587] [INFO] [utils.py:56:is_zero_supported_optimizer] Checking ZeRO support for optimizer=FusedEmaAdam type=<class 'sat.ops.fused_ema_adam.FusedEmaAdam'>
[2024-09-09 16:58:04,587] [WARNING] [engine.py:1179:_do_optimizer_sanity_check] **** You are using ZeRO with an untested optimizer, proceed with caution *****
[2024-09-09 16:58:04,587] [INFO] [logging.py:96:log_dist] [Rank 0] Creating torch.float16 ZeRO stage 2 optimizer
[2024-09-09 16:58:04,587] [INFO] [stage_1_and_2.py:148:__init__] Reduce bucket size 1000000000
[2024-09-09 16:58:04,587] [INFO] [stage_1_and_2.py:149:__init__] Allgather bucket size 1000000000
[2024-09-09 16:58:04,587] [INFO] [stage_1_and_2.py:150:__init__] CPU Offload: False
[2024-09-09 16:58:04,587] [INFO] [stage_1_and_2.py:151:__init__] Round robin gradient partitioning: False
[2024-09-09 16:58:06,802] [INFO] [utils.py:781:see_memory_usage] Before initializing optimizer states
[2024-09-09 16:58:06,803] [INFO] [utils.py:782:see_memory_usage] MA 12.86 GB         Max_MA 12.97 GB         CA 13.23 GB         Max_CA 13 GB 
[2024-09-09 16:58:06,803] [INFO] [utils.py:789:see_memory_usage] CPU Virtual Memory:  used = 32.85 GB, percent = 1.7%
[2024-09-09 16:58:07,025] [INFO] [utils.py:781:see_memory_usage] After initializing optimizer states
[2024-09-09 16:58:07,025] [INFO] [utils.py:782:see_memory_usage] MA 12.86 GB         Max_MA 13.08 GB         CA 13.45 GB         Max_CA 13 GB 
[2024-09-09 16:58:07,025] [INFO] [utils.py:789:see_memory_usage] CPU Virtual Memory:  used = 32.71 GB, percent = 1.7%
[2024-09-09 16:58:07,025] [INFO] [stage_1_and_2.py:543:__init__] optimizer state initialized
[2024-09-09 16:58:07,246] [INFO] [utils.py:781:see_memory_usage] After initializing ZeRO optimizer
[2024-09-09 16:58:07,246] [INFO] [utils.py:782:see_memory_usage] MA 12.86 GB         Max_MA 12.86 GB         CA 13.45 GB         Max_CA 13 GB 
[2024-09-09 16:58:07,246] [INFO] [utils.py:789:see_memory_usage] CPU Virtual Memory:  used = 32.93 GB, percent = 1.7%
[2024-09-09 16:58:07,251] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Final Optimizer = DeepSpeedZeroOptimizer
[2024-09-09 16:58:07,251] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed using client LR scheduler
[2024-09-09 16:58:07,251] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed LR Scheduler = None
[2024-09-09 16:58:07,251] [INFO] [logging.py:96:log_dist] [Rank 0] step=0, skipped=0, lr=[1.0], mom=[[0.9, 0.95]]
[2024-09-09 16:58:07,254] [INFO] [config.py:997:print] DeepSpeedEngine configuration:
[2024-09-09 16:58:07,254] [INFO] [config.py:1001:print]   activation_checkpointing_config  {
    "partition_activations": false, 
    "contiguous_memory_optimization": false, 
    "cpu_checkpointing": false, 
    "number_checkpoints": null, 
    "synchronize_checkpoint_boundary": false, 
    "profile": false
}
[2024-09-09 16:58:07,254] [INFO] [config.py:1001:print]   aio_config ................... {'block_size': 1048576, 'queue_depth': 8, 'thread_count': 1, 'single_submit': False, 'overlap_events': True}
[2024-09-09 16:58:07,254] [INFO] [config.py:1001:print]   amp_enabled .................. False
[2024-09-09 16:58:07,254] [INFO] [config.py:1001:print]   amp_params ................... False
[2024-09-09 16:58:07,255] [INFO] [config.py:1001:print]   autotuning_config ............ {
    "enabled": false, 
    "start_step": null, 
    "end_step": null, 
    "metric_path": null, 
    "arg_mappings": null, 
    "metric": "throughput", 
    "model_info": null, 
    "results_dir": "autotuning_results", 
    "exps_dir": "autotuning_exps", 
    "overwrite": true, 
    "fast": true, 
    "start_profile_step": 3, 
    "end_profile_step": 5, 
    "tuner_type": "gridsearch", 
    "tuner_early_stopping": 5, 
    "tuner_num_trials": 50, 
    "model_info_path": null, 
    "mp_size": 1, 
    "max_train_batch_size": null, 
    "min_train_batch_size": 1, 
    "max_train_micro_batch_size_per_gpu": 1.024000e+03, 
    "min_train_micro_batch_size_per_gpu": 1, 
    "num_tuning_micro_batch_sizes": 3
}
[2024-09-09 16:58:07,255] [INFO] [config.py:1001:print]   bfloat16_enabled ............. False
[2024-09-09 16:58:07,255] [INFO] [config.py:1001:print]   bfloat16_immediate_grad_update  False
[2024-09-09 16:58:07,255] [INFO] [config.py:1001:print]   checkpoint_parallel_write_pipeline  False
[2024-09-09 16:58:07,255] [INFO] [config.py:1001:print]   checkpoint_tag_validation_enabled  True
[2024-09-09 16:58:07,255] [INFO] [config.py:1001:print]   checkpoint_tag_validation_fail  False
[2024-09-09 16:58:07,255] [INFO] [config.py:1001:print]   comms_config ................. <deepspeed.comm.config.DeepSpeedCommsConfig object at 0x7fcc80151100>
[2024-09-09 16:58:07,255] [INFO] [config.py:1001:print]   communication_data_type ...... None
[2024-09-09 16:58:07,255] [INFO] [config.py:1001:print]   compression_config ........... {'weight_quantization': {'shared_parameters': {'enabled': False, 'quantizer_kernel': False, 'schedule_offset': 0, 'quantize_groups': 1, 'quantize_verbose': False, 'quantization_type': 'symmetric', 'quantize_weight_in_forward': False, 'rounding': 'nearest', 'fp16_mixed_quantize': False, 'quantize_change_ratio': 0.001}, 'different_groups': {}}, 'activation_quantization': {'shared_parameters': {'enabled': False, 'quantization_type': 'symmetric', 'range_calibration': 'dynamic', 'schedule_offset': 1000}, 'different_groups': {}}, 'sparse_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'row_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'head_pruning': {'shared_parameters': {'enabled': False, 'method': 'topk', 'schedule_offset': 1000}, 'different_groups': {}}, 'channel_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'layer_reduction': {'enabled': False}}
[2024-09-09 16:58:07,255] [INFO] [config.py:1001:print]   curriculum_enabled_legacy .... False
[2024-09-09 16:58:07,255] [INFO] [config.py:1001:print]   curriculum_params_legacy ..... False
[2024-09-09 16:58:07,255] [INFO] [config.py:1001:print]   data_efficiency_config ....... {'enabled': False, 'seed': 1234, 'data_sampling': {'enabled': False, 'num_epochs': 1000, 'num_workers': 0, 'curriculum_learning': {'enabled': False}}, 'data_routing': {'enabled': False, 'random_ltd': {'enabled': False, 'layer_token_lr_schedule': {'enabled': False}}}}
[2024-09-09 16:58:07,255] [INFO] [config.py:1001:print]   data_efficiency_enabled ...... False
[2024-09-09 16:58:07,255] [INFO] [config.py:1001:print]   dataloader_drop_last ......... False
[2024-09-09 16:58:07,255] [INFO] [config.py:1001:print]   disable_allgather ............ False
[2024-09-09 16:58:07,255] [INFO] [config.py:1001:print]   dump_state ................... False
[2024-09-09 16:58:07,255] [INFO] [config.py:1001:print]   dynamic_loss_scale_args ...... None
[2024-09-09 16:58:07,255] [INFO] [config.py:1001:print]   eigenvalue_enabled ........... False
[2024-09-09 16:58:07,255] [INFO] [config.py:1001:print]   eigenvalue_gas_boundary_resolution  1
[2024-09-09 16:58:07,255] [INFO] [config.py:1001:print]   eigenvalue_layer_name ........ bert.encoder.layer
[2024-09-09 16:58:07,255] [INFO] [config.py:1001:print]   eigenvalue_layer_num ......... 0
[2024-09-09 16:58:07,255] [INFO] [config.py:1001:print]   eigenvalue_max_iter .......... 100
[2024-09-09 16:58:07,255] [INFO] [config.py:1001:print]   eigenvalue_stability ......... 1e-06
[2024-09-09 16:58:07,255] [INFO] [config.py:1001:print]   eigenvalue_tol ............... 0.01
[2024-09-09 16:58:07,255] [INFO] [config.py:1001:print]   eigenvalue_verbose ........... False
[2024-09-09 16:58:07,255] [INFO] [config.py:1001:print]   elasticity_enabled ........... False
[2024-09-09 16:58:07,255] [INFO] [config.py:1001:print]   flops_profiler_config ........ {
    "enabled": false, 
    "recompute_fwd_factor": 0.0, 
    "profile_step": 1, 
    "module_depth": -1, 
    "top_modules": 1, 
    "detailed": true, 
    "output_file": null
}
[2024-09-09 16:58:07,255] [INFO] [config.py:1001:print]   fp16_auto_cast ............... False
[2024-09-09 16:58:07,255] [INFO] [config.py:1001:print]   fp16_enabled ................. True
[2024-09-09 16:58:07,256] [INFO] [config.py:1001:print]   fp16_master_weights_and_gradients  False
[2024-09-09 16:58:07,256] [INFO] [config.py:1001:print]   global_rank .................. 0
[2024-09-09 16:58:07,256] [INFO] [config.py:1001:print]   grad_accum_dtype ............. None
[2024-09-09 16:58:07,256] [INFO] [config.py:1001:print]   gradient_accumulation_steps .. 1
[2024-09-09 16:58:07,256] [INFO] [config.py:1001:print]   gradient_clipping ............ 0.1
[2024-09-09 16:58:07,256] [INFO] [config.py:1001:print]   gradient_predivide_factor .... 1.0
[2024-09-09 16:58:07,256] [INFO] [config.py:1001:print]   graph_harvesting ............. False
[2024-09-09 16:58:07,256] [INFO] [config.py:1001:print]   hybrid_engine ................ enabled=False max_out_tokens=512 inference_tp_size=1 release_inference_cache=False pin_parameters=True tp_gather_partition_size=8
[2024-09-09 16:58:07,256] [INFO] [config.py:1001:print]   initial_dynamic_scale ........ 65536
[2024-09-09 16:58:07,256] [INFO] [config.py:1001:print]   load_universal_checkpoint .... False
[2024-09-09 16:58:07,256] [INFO] [config.py:1001:print]   loss_scale ................... 0
[2024-09-09 16:58:07,256] [INFO] [config.py:1001:print]   memory_breakdown ............. False
[2024-09-09 16:58:07,256] [INFO] [config.py:1001:print]   mics_hierarchial_params_gather  False
[2024-09-09 16:58:07,256] [INFO] [config.py:1001:print]   mics_shard_size .............. -1
[2024-09-09 16:58:07,256] [INFO] [config.py:1001:print]   monitor_config ............... tensorboard=TensorBoardConfig(enabled=False, output_path='', job_name='DeepSpeedJobName') comet=CometConfig(enabled=False, samples_log_interval=100, project=None, workspace=None, api_key=None, experiment_name=None, experiment_key=None, online=None, mode=None) wandb=WandbConfig(enabled=False, group=None, team=None, project='deepspeed') csv_monitor=CSVConfig(enabled=False, output_path='', job_name='DeepSpeedJobName') enabled=False
[2024-09-09 16:58:07,256] [INFO] [config.py:1001:print]   nebula_config ................ {
    "enabled": false, 
    "persistent_storage_path": null, 
    "persistent_time_interval": 100, 
    "num_of_version_in_retention": 2, 
    "enable_nebula_load": true, 
    "load_path": null
}
[2024-09-09 16:58:07,256] [INFO] [config.py:1001:print]   optimizer_legacy_fusion ...... False
[2024-09-09 16:58:07,256] [INFO] [config.py:1001:print]   optimizer_name ............... None
[2024-09-09 16:58:07,256] [INFO] [config.py:1001:print]   optimizer_params ............. None
[2024-09-09 16:58:07,256] [INFO] [config.py:1001:print]   pipeline ..................... {'stages': 'auto', 'partition': 'best', 'seed_layers': False, 'activation_checkpoint_interval': 0, 'pipe_partitioned': True, 'grad_partitioned': True}
[2024-09-09 16:58:07,256] [INFO] [config.py:1001:print]   pld_enabled .................. False
[2024-09-09 16:58:07,256] [INFO] [config.py:1001:print]   pld_params ................... False
[2024-09-09 16:58:07,256] [INFO] [config.py:1001:print]   prescale_gradients ........... False
[2024-09-09 16:58:07,256] [INFO] [config.py:1001:print]   scheduler_name ............... None
[2024-09-09 16:58:07,256] [INFO] [config.py:1001:print]   scheduler_params ............. None
[2024-09-09 16:58:07,256] [INFO] [config.py:1001:print]   seq_parallel_communication_data_type  torch.float32
[2024-09-09 16:58:07,256] [INFO] [config.py:1001:print]   sparse_attention ............. None
[2024-09-09 16:58:07,256] [INFO] [config.py:1001:print]   sparse_gradients_enabled ..... False
[2024-09-09 16:58:07,256] [INFO] [config.py:1001:print]   steps_per_print .............. 50
[2024-09-09 16:58:07,256] [INFO] [config.py:1001:print]   timers_config ................ enabled=True synchronized=True
[2024-09-09 16:58:07,256] [INFO] [config.py:1001:print]   train_batch_size ............. 2
[2024-09-09 16:58:07,256] [INFO] [config.py:1001:print]   train_micro_batch_size_per_gpu  2
[2024-09-09 16:58:07,256] [INFO] [config.py:1001:print]   use_data_before_expert_parallel_  False
[2024-09-09 16:58:07,256] [INFO] [config.py:1001:print]   use_node_local_storage ....... False
[2024-09-09 16:58:07,256] [INFO] [config.py:1001:print]   wall_clock_breakdown ......... False
[2024-09-09 16:58:07,256] [INFO] [config.py:1001:print]   weight_quantization_config ... None
[2024-09-09 16:58:07,256] [INFO] [config.py:1001:print]   world_size ................... 1
[2024-09-09 16:58:07,256] [INFO] [config.py:1001:print]   zero_allow_untested_optimizer  True
[2024-09-09 16:58:07,257] [INFO] [config.py:1001:print]   zero_config .................. stage=2 contiguous_gradients=False reduce_scatter=True reduce_bucket_size=1000000000 use_multi_rank_bucket_allreduce=True allgather_partitions=True allgather_bucket_size=1000000000 overlap_comm=True load_from_fp32_weights=False elastic_checkpoint=False offload_param=None offload_optimizer=None sub_group_size=1,000,000,000 cpu_offload_param=None cpu_offload_use_pin_memory=None prefetch_bucket_size=50,000,000 param_persistence_threshold=100,000 model_persistence_threshold=sys.maxsize max_live_parameters=1,000,000,000 max_reuse_distance=1,000,000,000 gather_16bit_weights_on_model_save=False use_all_reduce_for_fetch_params=False stage3_gather_fp16_weights_on_model_save=False ignore_unused_parameters=True legacy_stage1=False round_robin_gradients=False zero_hpz_partition_size=1 zero_quantized_weights=False zero_quantized_nontrainable_weights=False zero_quantized_gradients=False mics_shard_size=-1 mics_hierarchical_params_gather=False memory_efficient_linear=True pipeline_loading_checkpoint=False override_module_apply=True
[2024-09-09 16:58:07,257] [INFO] [config.py:1001:print]   zero_enabled ................. True
[2024-09-09 16:58:07,257] [INFO] [config.py:1001:print]   zero_force_ds_cpu_optimizer .. True
[2024-09-09 16:58:07,257] [INFO] [config.py:1001:print]   zero_optimization_stage ...... 2
[2024-09-09 16:58:07,257] [INFO] [config.py:987:print_user_config]   json = {
    "train_micro_batch_size_per_gpu": 2, 
    "gradient_accumulation_steps": 1, 
    "steps_per_print": 50, 
    "gradient_clipping": 0.1, 
    "zero_optimization": {
        "stage": 2, 
        "cpu_offload": false, 
        "contiguous_gradients": false, 
        "overlap_comm": true, 
        "reduce_scatter": true, 
        "reduce_bucket_size": 1.000000e+09, 
        "allgather_bucket_size": 1.000000e+09, 
        "load_from_fp32_weights": false
    }, 
    "zero_allow_untested_optimizer": true, 
    "bf16": {
        "enabled": false
    }, 
    "fp16": {
        "enabled": true
    }, 
    "loss_scale": 0, 
    "loss_scale_window": 400, 
    "hysteresis": 2, 
    "min_loss_scale": 1, 
    "activation_checkpointing": {
        "partition_activations": false, 
        "contiguous_memory_optimization": false
    }, 
    "wall_clock_breakdown": false
}
[2024-09-09 16:58:07,257] [INFO] [RANK 0] learning rate decaying style linear, ratio 10.0
[2024-09-09 16:58:07,257] [INFO] [RANK 0] Finetuning Model...
[2024-09-09 16:58:07,257] [INFO] [RANK 0] arguments:
[2024-09-09 16:58:07,257] [INFO] [RANK 0]   base ......................... ['configs/cogvideox_2b_lora.yaml', 'configs/sft.yaml']
[2024-09-09 16:58:07,257] [INFO] [RANK 0]   model_parallel_size .......... 1
[2024-09-09 16:58:07,257] [INFO] [RANK 0]   force_pretrain ............... False
[2024-09-09 16:58:07,257] [INFO] [RANK 0]   device ....................... 0
[2024-09-09 16:58:07,257] [INFO] [RANK 0]   debug ........................ False
[2024-09-09 16:58:07,257] [INFO] [RANK 0]   log_image .................... True
[2024-09-09 16:58:07,257] [INFO] [RANK 0]   output_dir ................... samples
[2024-09-09 16:58:07,257] [INFO] [RANK 0]   input_dir .................... None
[2024-09-09 16:58:07,257] [INFO] [RANK 0]   input_type ................... cli
[2024-09-09 16:58:07,257] [INFO] [RANK 0]   input_file ................... input.txt
[2024-09-09 16:58:07,257] [INFO] [RANK 0]   final_size ................... 2048
[2024-09-09 16:58:07,257] [INFO] [RANK 0]   sdedit ....................... False
[2024-09-09 16:58:07,257] [INFO] [RANK 0]   grid_num_rows ................ 1
[2024-09-09 16:58:07,257] [INFO] [RANK 0]   force_inference .............. False
[2024-09-09 16:58:07,257] [INFO] [RANK 0]   lcm_steps .................... None
[2024-09-09 16:58:07,257] [INFO] [RANK 0]   sampling_num_frames .......... 32
[2024-09-09 16:58:07,257] [INFO] [RANK 0]   sampling_fps ................. 8
[2024-09-09 16:58:07,258] [INFO] [RANK 0]   only_save_latents ............ False
[2024-09-09 16:58:07,258] [INFO] [RANK 0]   only_log_video_latents ....... True
[2024-09-09 16:58:07,258] [INFO] [RANK 0]   latent_channels .............. 32
[2024-09-09 16:58:07,258] [INFO] [RANK 0]   image2video .................. False
[2024-09-09 16:58:07,258] [INFO] [RANK 0]   experiment_name .............. lora-test-09-09-16-57
[2024-09-09 16:58:07,258] [INFO] [RANK 0]   train_iters .................. 100
[2024-09-09 16:58:07,258] [INFO] [RANK 0]   batch_size ................... 2
[2024-09-09 16:58:07,258] [INFO] [RANK 0]   lr ........................... 0.001
[2024-09-09 16:58:07,258] [INFO] [RANK 0]   mode ......................... finetune
[2024-09-09 16:58:07,258] [INFO] [RANK 0]   seed ......................... 5243
[2024-09-09 16:58:07,258] [INFO] [RANK 0]   zero_stage ................... 0
[2024-09-09 16:58:07,258] [INFO] [RANK 0]   checkpoint_activations ....... True
[2024-09-09 16:58:07,258] [INFO] [RANK 0]   checkpoint_num_layers ........ 1
[2024-09-09 16:58:07,258] [INFO] [RANK 0]   checkpoint_skip_layers ....... 0
[2024-09-09 16:58:07,258] [INFO] [RANK 0]   fp16 ......................... True
[2024-09-09 16:58:07,258] [INFO] [RANK 0]   bf16 ......................... False
[2024-09-09 16:58:07,258] [INFO] [RANK 0]   gradient_accumulation_steps .. 1
[2024-09-09 16:58:07,258] [INFO] [RANK 0]   profiling .................... -1
[2024-09-09 16:58:07,258] [INFO] [RANK 0]   epochs ....................... None
[2024-09-09 16:58:07,258] [INFO] [RANK 0]   log_interval ................. 20
[2024-09-09 16:58:07,258] [INFO] [RANK 0]   summary_dir .................. 
[2024-09-09 16:58:07,258] [INFO] [RANK 0]   save_args .................... False
[2024-09-09 16:58:07,258] [INFO] [RANK 0]   lr_decay_iters ............... None
[2024-09-09 16:58:07,258] [INFO] [RANK 0]   lr_decay_style ............... linear
[2024-09-09 16:58:07,258] [INFO] [RANK 0]   lr_decay_ratio ............... 0.1
[2024-09-09 16:58:07,258] [INFO] [RANK 0]   warmup ....................... 0.01
[2024-09-09 16:58:07,258] [INFO] [RANK 0]   weight_decay ................. 0.0001
[2024-09-09 16:58:07,258] [INFO] [RANK 0]   save ......................... ckpts_2b_lora/lora-test-09-09-16-57
[2024-09-09 16:58:07,258] [INFO] [RANK 0]   load ......................... /root/CogVideo/CogVideoX-2b-sat/transformer
[2024-09-09 16:58:07,258] [INFO] [RANK 0]   force_train .................. True
[2024-09-09 16:58:07,258] [INFO] [RANK 0]   save_interval ................ 50
[2024-09-09 16:58:07,258] [INFO] [RANK 0]   no_save_rng .................. False
[2024-09-09 16:58:07,258] [INFO] [RANK 0]   no_load_rng .................. True
[2024-09-09 16:58:07,259] [INFO] [RANK 0]   resume_dataloader ............ False
[2024-09-09 16:58:07,259] [INFO] [RANK 0]   distributed_backend .......... nccl
[2024-09-09 16:58:07,259] [INFO] [RANK 0]   local_rank ................... 0
[2024-09-09 16:58:07,259] [INFO] [RANK 0]   exit_interval ................ None
[2024-09-09 16:58:07,259] [INFO] [RANK 0]   wandb ........................ False
[2024-09-09 16:58:07,259] [INFO] [RANK 0]   wandb_project_name ........... default_project
[2024-09-09 16:58:07,259] [INFO] [RANK 0]   eval_batch_size .............. 1
[2024-09-09 16:58:07,259] [INFO] [RANK 0]   eval_iters ................... 1
[2024-09-09 16:58:07,259] [INFO] [RANK 0]   eval_interval ................ 10
[2024-09-09 16:58:07,259] [INFO] [RANK 0]   strict_eval .................. False
[2024-09-09 16:58:07,259] [INFO] [RANK 0]   train_data ................... ['/root/CogVideo/sat/datasets/test']
[2024-09-09 16:58:07,259] [INFO] [RANK 0]   train_data_weights ........... None
[2024-09-09 16:58:07,259] [INFO] [RANK 0]   iterable_dataset ............. False
[2024-09-09 16:58:07,259] [INFO] [RANK 0]   iterable_dataset_eval ........ 
[2024-09-09 16:58:07,259] [INFO] [RANK 0]   batch_from_same_dataset ...... False
[2024-09-09 16:58:07,259] [INFO] [RANK 0]   valid_data ................... ['/root/CogVideo/sat/datasets/test']
[2024-09-09 16:58:07,259] [INFO] [RANK 0]   test_data .................... None
[2024-09-09 16:58:07,259] [INFO] [RANK 0]   split ........................ 1,0,0
[2024-09-09 16:58:07,259] [INFO] [RANK 0]   num_workers .................. 8
[2024-09-09 16:58:07,259] [INFO] [RANK 0]   block_size ................... 10000
[2024-09-09 16:58:07,259] [INFO] [RANK 0]   prefetch_factor .............. 4
[2024-09-09 16:58:07,259] [INFO] [RANK 0]   deepspeed .................... True
[2024-09-09 16:58:07,259] [INFO] [RANK 0]   deepspeed_config ............. {'train_micro_batch_size_per_gpu': 2, 'gradient_accumulation_steps': 1, 'steps_per_print': 50, 'gradient_clipping': 0.1, 'zero_optimization': {'stage': 2, 'cpu_offload': False, 'contiguous_gradients': False, 'overlap_comm': True, 'reduce_scatter': True, 'reduce_bucket_size': 1000000000, 'allgather_bucket_size': 1000000000, 'load_from_fp32_weights': False}, 'zero_allow_untested_optimizer': True, 'bf16': {'enabled': False}, 'fp16': {'enabled': True}, 'loss_scale': 0, 'loss_scale_window': 400, 'hysteresis': 2, 'min_loss_scale': 1, 'activation_checkpointing': {'partition_activations': False, 'contiguous_memory_optimization': False}, 'wall_clock_breakdown': False}
[2024-09-09 16:58:07,259] [INFO] [RANK 0]   deepscale .................... False
[2024-09-09 16:58:07,259] [INFO] [RANK 0]   deepscale_config ............. None
[2024-09-09 16:58:07,260] [INFO] [RANK 0]   model_config ................. {'scale_factor': 1.15258426, 'disable_first_stage_autocast': True, 'not_trainable_prefixes': ['all'], 'log_keys': ['txt'], 'denoiser_config': {'target': 'sgm.modules.diffusionmodules.denoiser.DiscreteDenoiser', 'params': {'num_idx': 1000, 'quantize_c_noise': False, 'weighting_config': {'target': 'sgm.modules.diffusionmodules.denoiser_weighting.EpsWeighting'}, 'scaling_config': {'target': 'sgm.modules.diffusionmodules.denoiser_scaling.VideoScaling'}, 'discretization_config': {'target': 'sgm.modules.diffusionmodules.discretizer.ZeroSNRDDPMDiscretization', 'params': {'shift_scale': 3.0}}}}, 'network_config': {'target': 'dit_video_concat.DiffusionTransformer', 'params': {'time_embed_dim': 512, 'elementwise_affine': True, 'num_frames': 49, 'time_compressed_rate': 4, 'latent_width': 90, 'latent_height': 60, 'num_layers': 30, 'patch_size': 2, 'in_channels': 16, 'out_channels': 16, 'hidden_size': 1920, 'adm_in_channels': 256, 'num_attention_heads': 30, 'transformer_args': {'checkpoint_activations': True, 'vocab_size': 1, 'max_sequence_length': 64, 'layernorm_order': 'pre', 'skip_init': False, 'model_parallel_size': 1, 'is_decoder': False, 'num_layers': 30, 'hidden_size': 1920, 'num_attention_heads': 30, 'parallel_output': True}, 'modules': {'pos_embed_config': {'target': 'dit_video_concat.Basic3DPositionEmbeddingMixin', 'params': {'text_length': 226, 'height_interpolation': 1.875, 'width_interpolation': 1.875}}, 'lora_config': {'target': 'sat.model.finetune.lora2.LoraMixin', 'params': {'r': 128}}, 'patch_embed_config': {'target': 'dit_video_concat.ImagePatchEmbeddingMixin', 'params': {'text_hidden_size': 4096}}, 'adaln_layer_config': {'target': 'dit_video_concat.AdaLNMixin', 'params': {'qk_ln': True}}, 'final_layer_config': {'target': 'dit_video_concat.FinalLayerMixin'}}, 'dtype': 'fp16'}}, 'conditioner_config': {'target': 'sgm.modules.GeneralConditioner', 'params': {'emb_models': [{'is_trainable': False, 'input_key': 'txt', 'ucg_rate': 0.1, 'target': 'sgm.modules.encoders.modules.FrozenT5Embedder', 'params': {'model_dir': '/root/CogVideo/t5-v1_1-xxl', 'max_length': 226}}]}}, 'first_stage_config': {'target': 'vae_modules.autoencoder.VideoAutoencoderInferenceWrapper', 'params': {'cp_size': 1, 'ckpt_path': '/root/CogVideo/CogVideoX-2b-sat/vae/3d-vae.pt', 'ignore_keys': ['loss'], 'loss_config': {'target': 'torch.nn.Identity'}, 'regularizer_config': {'target': 'vae_modules.regularizers.DiagonalGaussianRegularizer'}, 'encoder_config': {'target': 'vae_modules.cp_enc_dec.ContextParallelEncoder3D', 'params': {'double_z': True, 'z_channels': 16, 'resolution': 256, 'in_channels': 3, 'out_ch': 3, 'ch': 128, 'ch_mult': [1, 2, 2, 4], 'attn_resolutions': [], 'num_res_blocks': 3, 'dropout': 0.0, 'gather_norm': True}}, 'decoder_config': {'target': 'vae_modules.cp_enc_dec.ContextParallelDecoder3D', 'params': {'double_z': True, 'z_channels': 16, 'resolution': 256, 'in_channels': 3, 'out_ch': 3, 'ch': 128, 'ch_mult': [1, 2, 2, 4], 'attn_resolutions': [], 'num_res_blocks': 3, 'dropout': 0.0, 'gather_norm': False}}}}, 'loss_fn_config': {'target': 'sgm.modules.diffusionmodules.loss.VideoDiffusionLoss', 'params': {'offset_noise_level': 0, 'sigma_sampler_config': {'target': 'sgm.modules.diffusionmodules.sigma_sampling.DiscreteSampling', 'params': {'uniform_sampling': True, 'num_idx': 1000, 'discretization_config': {'target': 'sgm.modules.diffusionmodules.discretizer.ZeroSNRDDPMDiscretization', 'params': {'shift_scale': 3.0}}}}}}, 'sampler_config': {'target': 'sgm.modules.diffusionmodules.sampling.VPSDEDPMPP2MSampler', 'params': {'num_steps': 50, 'verbose': True, 'discretization_config': {'target': 'sgm.modules.diffusionmodules.discretizer.ZeroSNRDDPMDiscretization', 'params': {'shift_scale': 3.0}}, 'guider_config': {'target': 'sgm.modules.diffusionmodules.guiders.DynamicCFG', 'params': {'scale': 6, 'exp': 5, 'num_steps': 50}}}}}
[2024-09-09 16:58:07,260] [INFO] [RANK 0]   data_config .................. {'target': 'data_video.SFTDataset', 'params': {'video_size': [480, 720], 'fps': 8, 'max_num_frames': 49, 'skip_frms_num': 3.0}}
[2024-09-09 16:58:07,260] [INFO] [RANK 0]   cuda ......................... True
[2024-09-09 16:58:07,260] [INFO] [RANK 0]   rank ......................... 0
[2024-09-09 16:58:07,260] [INFO] [RANK 0]   world_size ................... 1
[2024-09-09 16:58:07,260] [INFO] [RANK 0]   deepspeed_activation_checkpointing  True
[2024-09-09 16:58:07,260] [INFO] [RANK 0]   master_ip .................... localhost
[2024-09-09 16:58:07,260] [INFO] [RANK 0]   master_port .................. 57495
[2024-09-09 16:58:07,260] [INFO] [RANK 0]   log_config ................... [{'model': {'scale_factor': 1.15258426, 'disable_first_stage_autocast': True, 'not_trainable_prefixes': ['all'], 'log_keys': ['txt'], 'denoiser_config': {'target': 'sgm.modules.diffusionmodules.denoiser.DiscreteDenoiser', 'params': {'num_idx': 1000, 'quantize_c_noise': False, 'weighting_config': {'target': 'sgm.modules.diffusionmodules.denoiser_weighting.EpsWeighting'}, 'scaling_config': {'target': 'sgm.modules.diffusionmodules.denoiser_scaling.VideoScaling'}, 'discretization_config': {'target': 'sgm.modules.diffusionmodules.discretizer.ZeroSNRDDPMDiscretization', 'params': {'shift_scale': 3.0}}}}, 'network_config': {'target': 'dit_video_concat.DiffusionTransformer', 'params': {'time_embed_dim': 512, 'elementwise_affine': True, 'num_frames': 49, 'time_compressed_rate': 4, 'latent_width': 90, 'latent_height': 60, 'num_layers': 30, 'patch_size': 2, 'in_channels': 16, 'out_channels': 16, 'hidden_size': 1920, 'adm_in_channels': 256, 'num_attention_heads': 30, 'transformer_args': {'checkpoint_activations': True, 'vocab_size': 1, 'max_sequence_length': 64, 'layernorm_order': 'pre', 'skip_init': False, 'model_parallel_size': 1, 'is_decoder': False}, 'modules': {'pos_embed_config': {'target': 'dit_video_concat.Basic3DPositionEmbeddingMixin', 'params': {'text_length': 226, 'height_interpolation': 1.875, 'width_interpolation': 1.875}}, 'lora_config': {'target': 'sat.model.finetune.lora2.LoraMixin', 'params': {'r': 128}}, 'patch_embed_config': {'target': 'dit_video_concat.ImagePatchEmbeddingMixin', 'params': {'text_hidden_size': 4096}}, 'adaln_layer_config': {'target': 'dit_video_concat.AdaLNMixin', 'params': {'qk_ln': True}}, 'final_layer_config': {'target': 'dit_video_concat.FinalLayerMixin'}}}}, 'conditioner_config': {'target': 'sgm.modules.GeneralConditioner', 'params': {'emb_models': [{'is_trainable': False, 'input_key': 'txt', 'ucg_rate': 0.1, 'target': 'sgm.modules.encoders.modules.FrozenT5Embedder', 'params': {'model_dir': '/root/CogVideo/t5-v1_1-xxl', 'max_length': 226}}]}}, 'first_stage_config': {'target': 'vae_modules.autoencoder.VideoAutoencoderInferenceWrapper', 'params': {'cp_size': 1, 'ckpt_path': '/root/CogVideo/CogVideoX-2b-sat/vae/3d-vae.pt', 'ignore_keys': ['loss'], 'loss_config': {'target': 'torch.nn.Identity'}, 'regularizer_config': {'target': 'vae_modules.regularizers.DiagonalGaussianRegularizer'}, 'encoder_config': {'target': 'vae_modules.cp_enc_dec.ContextParallelEncoder3D', 'params': {'double_z': True, 'z_channels': 16, 'resolution': 256, 'in_channels': 3, 'out_ch': 3, 'ch': 128, 'ch_mult': [1, 2, 2, 4], 'attn_resolutions': [], 'num_res_blocks': 3, 'dropout': 0.0, 'gather_norm': True}}, 'decoder_config': {'target': 'vae_modules.cp_enc_dec.ContextParallelDecoder3D', 'params': {'double_z': True, 'z_channels': 16, 'resolution': 256, 'in_channels': 3, 'out_ch': 3, 'ch': 128, 'ch_mult': [1, 2, 2, 4], 'attn_resolutions': [], 'num_res_blocks': 3, 'dropout': 0.0, 'gather_norm': False}}}}, 'loss_fn_config': {'target': 'sgm.modules.diffusionmodules.loss.VideoDiffusionLoss', 'params': {'offset_noise_level': 0, 'sigma_sampler_config': {'target': 'sgm.modules.diffusionmodules.sigma_sampling.DiscreteSampling', 'params': {'uniform_sampling': True, 'num_idx': 1000, 'discretization_config': {'target': 'sgm.modules.diffusionmodules.discretizer.ZeroSNRDDPMDiscretization', 'params': {'shift_scale': 3.0}}}}}}, 'sampler_config': {'target': 'sgm.modules.diffusionmodules.sampling.VPSDEDPMPP2MSampler', 'params': {'num_steps': 50, 'verbose': True, 'discretization_config': {'target': 'sgm.modules.diffusionmodules.discretizer.ZeroSNRDDPMDiscretization', 'params': {'shift_scale': 3.0}}, 'guider_config': {'target': 'sgm.modules.diffusionmodules.guiders.DynamicCFG', 'params': {'scale': 6, 'exp': 5, 'num_steps': 50}}}}}}, {'args': {'checkpoint_activations': True, 'model_parallel_size': 1, 'experiment_name': 'lora-test', 'mode': 'finetune', 'load': '/root/CogVideo/CogVideoX-2b-sat/transformer', 'no_load_rng': True, 'train_iters': 100, 'eval_iters': 1, 'eval_interval': 10, 'eval_batch_size': 1, 'save': 'ckpts_2b_lora', 'save_interval': 50, 'log_interval': 20, 'train_data': ['/root/CogVideo/sat/datasets/test'], 'valid_data': ['/root/CogVideo/sat/datasets/test'], 'split': '1,0,0', 'num_workers': 8, 'force_train': True, 'only_log_video_latents': True}, 'data': {'target': 'data_video.SFTDataset', 'params': {'video_size': [480, 720], 'fps': 8, 'max_num_frames': 49, 'skip_frms_num': 3.0}}, 'deepspeed': {'train_micro_batch_size_per_gpu': 2, 'gradient_accumulation_steps': 1, 'steps_per_print': 50, 'gradient_clipping': 0.1, 'zero_optimization': {'stage': 2, 'cpu_offload': False, 'contiguous_gradients': False, 'overlap_comm': True, 'reduce_scatter': True, 'reduce_bucket_size': 1000000000, 'allgather_bucket_size': 1000000000, 'load_from_fp32_weights': False}, 'zero_allow_untested_optimizer': True, 'bf16': {'enabled': False}, 'fp16': {'enabled': True}, 'loss_scale': 0, 'loss_scale_window': 400, 'hysteresis': 2, 'min_loss_scale': 1, 'optimizer': {'type': 'sat.ops.FusedEmaAdam', 'params': {'lr': 0.001, 'betas': [0.9, 0.95], 'eps': '1e-8', 'weight_decay': '1e-4'}}, 'activation_checkpointing': {'partition_activations': False, 'contiguous_memory_optimization': False}, 'wall_clock_breakdown': False}}]
[2024-09-09 16:58:07,260] [INFO] [RANK 0]   do_train ..................... True
[2024-09-09 16:58:07,260] [INFO] [RANK 0]   val_last_shape ............... []
[2024-09-09 16:58:07,260] [INFO] [RANK 0]   val_drop_number .............. 0
[2024-09-09 16:58:07,260] [INFO] [RANK 0]   do_valid ..................... True
[2024-09-09 16:58:07,260] [INFO] [RANK 0]   do_test ...................... False
[2024-09-09 16:58:07,260] [INFO] [RANK 0]   iteration .................... 0
[2024-09-09 16:58:56,248] [INFO] [checkpointing.py:541:forward] Activation Checkpointing Information
[2024-09-09 16:58:56,248] [INFO] [checkpointing.py:542:forward] ----Partition Activations False, CPU CHECKPOINTING False
[2024-09-09 16:58:56,248] [INFO] [checkpointing.py:543:forward] ----contiguous Memory Checkpointing False with None total layers
[2024-09-09 16:58:56,248] [INFO] [checkpointing.py:545:forward] ----Synchronization False
[2024-09-09 16:58:56,248] [INFO] [checkpointing.py:546:forward] ----Profiling time in checkpointing False
[2024-09-09 16:59:06,008] [INFO] [loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 4294967296, reducing to 2147483648
[2024-09-09 16:59:29,703] [INFO] [loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 2147483648, reducing to 1073741824
[2024-09-09 16:59:53,115] [INFO] [loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 1073741824, reducing to 536870912
[2024-09-09 17:01:04,649] [INFO] [loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 536870912, reducing to 268435456
[2024-09-09 17:01:51,938] [INFO] [loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 268435456, reducing to 134217728
/root/CogVideo/sat/train_video.py:67: DeprecationWarning: torch.get_autocast_gpu_dtype() is deprecated. Please use torch.get_autocast_dtype('cuda') instead. (Triggered internally at ../torch/csrc/autograd/init.cpp:733.)
  "dtype": torch.get_autocast_gpu_dtype(),
/root/CogVideo/sat/train_video.py:70: FutureWarning: `torch.cuda.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cuda', args...)` instead.
  with torch.no_grad(), torch.cuda.amp.autocast(**gpu_autocast_kwargs):
##############################  Sampling setting  ##############################
Sampler: VPSDEDPMPP2MSampler
Discretization: ZeroSNRDDPMDiscretization
Guider: DynamicCFG
Sampling with VPSDEDPMPP2MSampler for 51 steps:  98%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▉      | 50/51 [01:24<00:01,  1.69s/it]
[2024-09-09 17:04:19,474] [INFO] [RANK 0] ----------------------------------------------------------------------------------------------------
[2024-09-09 17:04:19,474] [INFO] [RANK 0] ----------------------------------------------------------------------------------------------
[2024-09-09 17:04:19,474] [INFO] [RANK 0]  validation loss at iteration 10 | loss: 1.002032E-01 | PPL: 1.105395E+00 loss 1.002032E-01 |
[2024-09-09 17:04:19,474] [INFO] [RANK 0] ----------------------------------------------------------------------------------------------
[2024-09-09 17:05:49,038] [INFO] [loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 134217728, reducing to 67108864
[rank0]: Traceback (most recent call last):
[rank0]:   File "/root/CogVideo/sat/train_video.py", line 226, in <module>
[rank0]:     training_main(
[rank0]:   File "/root/miniconda3/envs/cogvideo/lib/python3.12/site-packages/sat/training/deepspeed_training.py", line 157, in training_main
[rank0]:     iteration, skipped = train(model, optimizer,
[rank0]:                          ^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/root/miniconda3/envs/cogvideo/lib/python3.12/site-packages/sat/training/deepspeed_training.py", line 359, in train
[rank0]:     lm_loss, skipped_iter, metrics = train_step(train_data_iterator,
[rank0]:                                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/root/miniconda3/envs/cogvideo/lib/python3.12/site-packages/sat/training/deepspeed_training.py", line 443, in train_step
[rank0]:     forward_ret = forward_step(data_iterator, model, args, timers, **kwargs)
[rank0]:                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/root/CogVideo/sat/train_video.py", line 176, in forward_step
[rank0]:     batch = next(data_iterator)
[rank0]:             ^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/root/miniconda3/envs/cogvideo/lib/python3.12/site-packages/torch/utils/data/dataloader.py", line 630, in __next__
[rank0]:     data = self._next_data()
[rank0]:            ^^^^^^^^^^^^^^^^^
[rank0]:   File "/root/miniconda3/envs/cogvideo/lib/python3.12/site-packages/torch/utils/data/dataloader.py", line 1324, in _next_data
[rank0]:     return self._process_data(data)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/root/miniconda3/envs/cogvideo/lib/python3.12/site-packages/torch/utils/data/dataloader.py", line 1370, in _process_data
[rank0]:     data.reraise()
[rank0]:   File "/root/miniconda3/envs/cogvideo/lib/python3.12/site-packages/torch/_utils.py", line 706, in reraise
[rank0]:     raise exception
[rank0]: ZeroDivisionError: Caught ZeroDivisionError in DataLoader worker process 7.
[rank0]: Original Traceback (most recent call last):
[rank0]:   File "/root/miniconda3/envs/cogvideo/lib/python3.12/site-packages/torch/utils/data/_utils/worker.py", line 309, in _worker_loop
[rank0]:     data = fetcher.fetch(index)  # type: ignore[possibly-undefined]
[rank0]:            ^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/root/miniconda3/envs/cogvideo/lib/python3.12/site-packages/torch/utils/data/_utils/fetch.py", line 52, in fetch
[rank0]:     data = [self.dataset[idx] for idx in possibly_batched_index]
[rank0]:             ~~~~~~~~~~~~^^^^^
[rank0]:   File "/root/miniconda3/envs/cogvideo/lib/python3.12/site-packages/sat/data_utils/configure_data.py", line 360, in __getitem__
[rank0]:     return self.wrapped_data[index]
[rank0]:            ~~~~~~~~~~~~~~~~~^^^^^^^
[rank0]:   File "/root/miniconda3/envs/cogvideo/lib/python3.12/site-packages/sat/data_utils/configure_data.py", line 342, in __getitem__
[rank0]:     return self.datasets[dataset_idx][sample_idx]
[rank0]:            ~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^
[rank0]:   File "/root/CogVideo/sat/data_video.py", line 411, in __getitem__
[rank0]:     indices = np.arange(start, end, (end - start) // num_frames).astype(int)
[rank0]:               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: ZeroDivisionError: division by zero

[3rd Trial] Reduce train_micro_batch_size_per_gpu 2->1
(cogvideo) root@alphacode-ttv-a100-80g-gpu:~/CogVideo/sat# bash finetune_single_gpu.sh 
RUN on alphacode-ttv-a100-80g-gpu, CUDA_VISIBLE_DEVICES=0
WORLD_SIZE=1 RANK=0 LOCAL_RANK=0 LOCAL_WORLD_SIZE=1 python train_video.py --base configs/cogvideox_2b_lora.yaml configs/sft.yaml --seed 27481
[2024-09-10 13:30:54,235] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
 [WARNING]  async_io requires the dev libaio .so object and headers but these were not found.
 [WARNING]  async_io: please install the libaio-dev package with apt
 [WARNING]  If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
 [WARNING]  Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
 [WARNING]  sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.4
 [WARNING]  using untested triton version (3.0.0), only 1.0.0 is known to be compatible
/root/miniconda3/envs/cogvideo/lib/python3.12/site-packages/deepspeed/runtime/zero/linear.py:47: FutureWarning: `torch.cuda.amp.custom_fwd(args...)` is deprecated. Please use `torch.amp.custom_fwd(args..., device_type='cuda')` instead.
  @autocast_custom_fwd
/root/miniconda3/envs/cogvideo/lib/python3.12/site-packages/deepspeed/runtime/zero/linear.py:66: FutureWarning: `torch.cuda.amp.custom_bwd(args...)` is deprecated. Please use `torch.amp.custom_bwd(args..., device_type='cuda')` instead.
  @autocast_custom_bwd
/root/miniconda3/envs/cogvideo/lib/python3.12/site-packages/kornia/feature/lightglue.py:44: FutureWarning: `torch.cuda.amp.custom_fwd(args...)` is deprecated. Please use `torch.amp.custom_fwd(args..., device_type='cuda')` instead.
  @torch.cuda.amp.custom_fwd(cast_inputs=torch.float32)
no module 'xformers'. Processing without...
no module 'xformers'. Processing without...
[2024-09-10 13:30:59,512] [INFO] using world size: 1
[2024-09-10 13:30:59,512] [INFO] Will override arguments with manually specified deepspeed_config!
[W910 13:30:59.341356778 socket.cpp:697] [c10d] The client socket cannot be initialized to connect to [ip6-localhost]:44107 (errno: 97 - Address family not supported by protocol).
[W910 13:30:59.342068481 socket.cpp:697] [c10d] The client socket cannot be initialized to connect to [alphacode-ttv-a100-80g-gpu]:44107 (errno: 97 - Address family not supported by protocol).
[2024-09-10 13:30:59,519] [INFO] [RANK 0] > initializing model parallel with size 1
[2024-09-10 13:30:59,520] [INFO] [comm.py:637:init_distributed] cdb=None
[2024-09-10 13:30:59,755] [INFO] [RANK 0] building SATVideoDiffusionEngine model ...
[2024-09-10 13:31:08,092] [INFO] [RANK 0] replacing layer 0 attention with lora
[2024-09-10 13:31:08,207] [INFO] [RANK 0] replacing layer 1 attention with lora
[2024-09-10 13:31:08,324] [INFO] [RANK 0] replacing layer 2 attention with lora
[2024-09-10 13:31:08,439] [INFO] [RANK 0] replacing layer 3 attention with lora
[2024-09-10 13:31:08,558] [INFO] [RANK 0] replacing layer 4 attention with lora
[2024-09-10 13:31:08,691] [INFO] [RANK 0] replacing layer 5 attention with lora
[2024-09-10 13:31:08,819] [INFO] [RANK 0] replacing layer 6 attention with lora
[2024-09-10 13:31:08,939] [INFO] [RANK 0] replacing layer 7 attention with lora
[2024-09-10 13:31:09,068] [INFO] [RANK 0] replacing layer 8 attention with lora
[2024-09-10 13:31:09,184] [INFO] [RANK 0] replacing layer 9 attention with lora
[2024-09-10 13:31:09,238] [INFO] [RANK 0] replacing layer 10 attention with lora
[2024-09-10 13:31:09,258] [INFO] [RANK 0] replacing layer 11 attention with lora
[2024-09-10 13:31:09,280] [INFO] [RANK 0] replacing layer 12 attention with lora
[2024-09-10 13:31:09,302] [INFO] [RANK 0] replacing layer 13 attention with lora
[2024-09-10 13:31:09,324] [INFO] [RANK 0] replacing layer 14 attention with lora
[2024-09-10 13:31:09,346] [INFO] [RANK 0] replacing layer 15 attention with lora
[2024-09-10 13:31:09,368] [INFO] [RANK 0] replacing layer 16 attention with lora
[2024-09-10 13:31:09,389] [INFO] [RANK 0] replacing layer 17 attention with lora
[2024-09-10 13:31:09,445] [INFO] [RANK 0] replacing layer 18 attention with lora
[2024-09-10 13:31:09,471] [INFO] [RANK 0] replacing layer 19 attention with lora
[2024-09-10 13:31:09,494] [INFO] [RANK 0] replacing layer 20 attention with lora
[2024-09-10 13:31:09,513] [INFO] [RANK 0] replacing layer 21 attention with lora
[2024-09-10 13:31:09,532] [INFO] [RANK 0] replacing layer 22 attention with lora
[2024-09-10 13:31:09,551] [INFO] [RANK 0] replacing layer 23 attention with lora
[2024-09-10 13:31:09,570] [INFO] [RANK 0] replacing layer 24 attention with lora
[2024-09-10 13:31:09,589] [INFO] [RANK 0] replacing layer 25 attention with lora
[2024-09-10 13:31:09,609] [INFO] [RANK 0] replacing layer 26 attention with lora
[2024-09-10 13:31:09,661] [INFO] [RANK 0] replacing layer 27 attention with lora
[2024-09-10 13:31:09,682] [INFO] [RANK 0] replacing layer 28 attention with lora
[2024-09-10 13:31:09,705] [INFO] [RANK 0] replacing layer 29 attention with lora
Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:01<00:00,  1.13it/s]
Initialized embedder #0: FrozenT5Embedder with 4762310656 params. Trainable: False
Working with z of shape (1, 16, 32, 32) = 16384 dimensions.
/root/CogVideo/sat/vae_modules/autoencoder.py:565: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature.
  sd = torch.load(path, map_location="cpu")["state_dict"]
Deleting key loss.logvar from state_dict.
Deleting key loss.perceptual_loss.scaling_layer.shift from state_dict.
Deleting key loss.perceptual_loss.scaling_layer.scale from state_dict.
Deleting key loss.perceptual_loss.net.slice1.0.weight from state_dict.
Deleting key loss.perceptual_loss.net.slice1.0.bias from state_dict.
Deleting key loss.perceptual_loss.net.slice1.2.weight from state_dict.
Deleting key loss.perceptual_loss.net.slice1.2.bias from state_dict.
Deleting key loss.perceptual_loss.net.slice2.5.weight from state_dict.
Deleting key loss.perceptual_loss.net.slice2.5.bias from state_dict.
Deleting key loss.perceptual_loss.net.slice2.7.weight from state_dict.
Deleting key loss.perceptual_loss.net.slice2.7.bias from state_dict.
Deleting key loss.perceptual_loss.net.slice3.10.weight from state_dict.
Deleting key loss.perceptual_loss.net.slice3.10.bias from state_dict.
Deleting key loss.perceptual_loss.net.slice3.12.weight from state_dict.
Deleting key loss.perceptual_loss.net.slice3.12.bias from state_dict.
Deleting key loss.perceptual_loss.net.slice3.14.weight from state_dict.
Deleting key loss.perceptual_loss.net.slice3.14.bias from state_dict.
Deleting key loss.perceptual_loss.net.slice4.17.weight from state_dict.
Deleting key loss.perceptual_loss.net.slice4.17.bias from state_dict.
Deleting key loss.perceptual_loss.net.slice4.19.weight from state_dict.
Deleting key loss.perceptual_loss.net.slice4.19.bias from state_dict.
Deleting key loss.perceptual_loss.net.slice4.21.weight from state_dict.
Deleting key loss.perceptual_loss.net.slice4.21.bias from state_dict.
Deleting key loss.perceptual_loss.net.slice5.24.weight from state_dict.
Deleting key loss.perceptual_loss.net.slice5.24.bias from state_dict.
Deleting key loss.perceptual_loss.net.slice5.26.weight from state_dict.
Deleting key loss.perceptual_loss.net.slice5.26.bias from state_dict.
Deleting key loss.perceptual_loss.net.slice5.28.weight from state_dict.
Deleting key loss.perceptual_loss.net.slice5.28.bias from state_dict.
Deleting key loss.perceptual_loss.lin0.model.1.weight from state_dict.
Deleting key loss.perceptual_loss.lin1.model.1.weight from state_dict.
Deleting key loss.perceptual_loss.lin2.model.1.weight from state_dict.
Deleting key loss.perceptual_loss.lin3.model.1.weight from state_dict.
Deleting key loss.perceptual_loss.lin4.model.1.weight from state_dict.
Deleting key loss.discriminator.blocks.0.downsample_res.conv.weight from state_dict.
Deleting key loss.discriminator.blocks.0.downsample_res.conv.bias from state_dict.
Deleting key loss.discriminator.blocks.0.net.0.conv.weight from state_dict.
Deleting key loss.discriminator.blocks.0.net.0.conv.bias from state_dict.
Deleting key loss.discriminator.blocks.0.net.2.conv.weight from state_dict.
Deleting key loss.discriminator.blocks.0.net.2.conv.bias from state_dict.
Deleting key loss.discriminator.blocks.0.downsample.conv.weight from state_dict.
Deleting key loss.discriminator.blocks.0.downsample.conv.bias from state_dict.
Deleting key loss.discriminator.blocks.1.downsample_res.conv.weight from state_dict.
Deleting key loss.discriminator.blocks.1.downsample_res.conv.bias from state_dict.
Deleting key loss.discriminator.blocks.1.net.0.conv.weight from state_dict.
Deleting key loss.discriminator.blocks.1.net.0.conv.bias from state_dict.
Deleting key loss.discriminator.blocks.1.net.2.conv.weight from state_dict.
Deleting key loss.discriminator.blocks.1.net.2.conv.bias from state_dict.
Deleting key loss.discriminator.blocks.1.downsample.conv.weight from state_dict.
Deleting key loss.discriminator.blocks.1.downsample.conv.bias from state_dict.
Deleting key loss.discriminator.blocks.2.downsample_res.conv.weight from state_dict.
Deleting key loss.discriminator.blocks.2.downsample_res.conv.bias from state_dict.
Deleting key loss.discriminator.blocks.2.net.0.conv.weight from state_dict.
Deleting key loss.discriminator.blocks.2.net.0.conv.bias from state_dict.
Deleting key loss.discriminator.blocks.2.net.2.conv.weight from state_dict.
Deleting key loss.discriminator.blocks.2.net.2.conv.bias from state_dict.
Deleting key loss.discriminator.blocks.2.downsample.conv.weight from state_dict.
Deleting key loss.discriminator.blocks.2.downsample.conv.bias from state_dict.
Deleting key loss.discriminator.blocks.3.downsample_res.conv.weight from state_dict.
Deleting key loss.discriminator.blocks.3.downsample_res.conv.bias from state_dict.
Deleting key loss.discriminator.blocks.3.net.0.conv.weight from state_dict.
Deleting key loss.discriminator.blocks.3.net.0.conv.bias from state_dict.
Deleting key loss.discriminator.blocks.3.net.2.conv.weight from state_dict.
Deleting key loss.discriminator.blocks.3.net.2.conv.bias from state_dict.
Deleting key loss.discriminator.blocks.3.downsample.conv.weight from state_dict.
Deleting key loss.discriminator.blocks.3.downsample.conv.bias from state_dict.
Deleting key loss.discriminator.blocks.4.0.conv_res.weight from state_dict.
Deleting key loss.discriminator.blocks.4.0.conv_res.bias from state_dict.
Deleting key loss.discriminator.blocks.4.0.net.0.weight from state_dict.
Deleting key loss.discriminator.blocks.4.0.net.0.bias from state_dict.
Deleting key loss.discriminator.blocks.4.0.net.2.weight from state_dict.
Deleting key loss.discriminator.blocks.4.0.net.2.bias from state_dict.
Deleting key loss.discriminator.blocks.4.0.downsample.1.weight from state_dict.
Deleting key loss.discriminator.blocks.4.0.downsample.1.bias from state_dict.
Deleting key loss.discriminator.blocks.4.1.0.fn.norm.gamma from state_dict.
Deleting key loss.discriminator.blocks.4.1.0.fn.attn.to_q.0.weight from state_dict.
Deleting key loss.discriminator.blocks.4.1.0.fn.attn.to_kv.0.weight from state_dict.
Deleting key loss.discriminator.blocks.4.1.0.fn.attn.to_out.0.weight from state_dict.
Deleting key loss.discriminator.blocks.4.1.1.fn.norm.gamma from state_dict.
Deleting key loss.discriminator.blocks.4.1.1.fn.net.0.weight from state_dict.
Deleting key loss.discriminator.blocks.4.1.1.fn.net.0.bias from state_dict.
Deleting key loss.discriminator.blocks.4.1.1.fn.net.2.weight from state_dict.
Deleting key loss.discriminator.blocks.4.1.1.fn.net.2.bias from state_dict.
Deleting key loss.discriminator.blocks.5.0.conv_res.weight from state_dict.
Deleting key loss.discriminator.blocks.5.0.conv_res.bias from state_dict.
Deleting key loss.discriminator.blocks.5.0.net.0.weight from state_dict.
Deleting key loss.discriminator.blocks.5.0.net.0.bias from state_dict.
Deleting key loss.discriminator.blocks.5.0.net.2.weight from state_dict.
Deleting key loss.discriminator.blocks.5.0.net.2.bias from state_dict.
Deleting key loss.discriminator.blocks.5.0.downsample.1.weight from state_dict.
Deleting key loss.discriminator.blocks.5.0.downsample.1.bias from state_dict.
Deleting key loss.discriminator.blocks.5.1.0.fn.norm.gamma from state_dict.
Deleting key loss.discriminator.blocks.5.1.0.fn.attn.to_q.0.weight from state_dict.
Deleting key loss.discriminator.blocks.5.1.0.fn.attn.to_kv.0.weight from state_dict.
Deleting key loss.discriminator.blocks.5.1.0.fn.attn.to_out.0.weight from state_dict.
Deleting key loss.discriminator.blocks.5.1.1.fn.norm.gamma from state_dict.
Deleting key loss.discriminator.blocks.5.1.1.fn.net.0.weight from state_dict.
Deleting key loss.discriminator.blocks.5.1.1.fn.net.0.bias from state_dict.
Deleting key loss.discriminator.blocks.5.1.1.fn.net.2.weight from state_dict.
Deleting key loss.discriminator.blocks.5.1.1.fn.net.2.bias from state_dict.
Deleting key loss.discriminator.blocks.6.0.conv_res.weight from state_dict.
Deleting key loss.discriminator.blocks.6.0.conv_res.bias from state_dict.
Deleting key loss.discriminator.blocks.6.0.net.0.weight from state_dict.
Deleting key loss.discriminator.blocks.6.0.net.0.bias from state_dict.
Deleting key loss.discriminator.blocks.6.0.net.2.weight from state_dict.
Deleting key loss.discriminator.blocks.6.0.net.2.bias from state_dict.
Deleting key loss.discriminator.blocks.6.1.0.fn.norm.gamma from state_dict.
Deleting key loss.discriminator.blocks.6.1.0.fn.attn.to_q.0.weight from state_dict.
Deleting key loss.discriminator.blocks.6.1.0.fn.attn.to_kv.0.weight from state_dict.
Deleting key loss.discriminator.blocks.6.1.0.fn.attn.to_out.0.weight from state_dict.
Deleting key loss.discriminator.blocks.6.1.1.fn.norm.gamma from state_dict.
Deleting key loss.discriminator.blocks.6.1.1.fn.net.0.weight from state_dict.
Deleting key loss.discriminator.blocks.6.1.1.fn.net.0.bias from state_dict.
Deleting key loss.discriminator.blocks.6.1.1.fn.net.2.weight from state_dict.
Deleting key loss.discriminator.blocks.6.1.1.fn.net.2.bias from state_dict.
Deleting key loss.discriminator.to_logits.0.weight from state_dict.
Deleting key loss.discriminator.to_logits.0.bias from state_dict.
Deleting key loss.discriminator.to_logits.3.weight from state_dict.
Deleting key loss.discriminator.to_logits.3.bias from state_dict.
Missing keys:  []
Unexpected keys:  []
Restored from /root/CogVideo/CogVideoX-2b-sat/vae/3d-vae.pt
[2024-09-10 13:31:15,450] [INFO] [RANK 0]  > number of parameters on model parallel rank 0: 6764790755
[2024-09-10 13:31:26,160] [INFO] [RANK 0] global rank 0 is loading checkpoint /root/CogVideo/CogVideoX-2b-sat/transformer/1000/mp_rank_00_model_states.pt
/root/miniconda3/envs/cogvideo/lib/python3.12/site-packages/sat/training/model_io.py:286: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature.
  sd = torch.load(checkpoint_name, map_location='cpu')
[2024-09-10 13:31:27,666] [INFO] [RANK 0] > successfully loaded /root/CogVideo/CogVideoX-2b-sat/transformer/1000/mp_rank_00_model_states.pt
[2024-09-10 13:31:28,191] [INFO] [RANK 0] ***** Total trainable parameters: 58982400 *****
[2024-09-10 13:31:28,191] [INFO] [RANK 0] [<class 'sat.ops.layernorm.LayerNorm'>, <class 'torch.nn.modules.normalization.LayerNorm'>, <class 'sat.ops.layernorm.RMSNorm'>] is set to no_weight_decay
[2024-09-10 13:31:28,194] [INFO] [RANK 0] Syncing initialized parameters...
[2024-09-10 13:31:28,302] [INFO] [RANK 0] Finished syncing initialized parameters.
[2024-09-10 13:31:28,302] [INFO] [RANK 0] Using optimizer sat.ops.FusedEmaAdam from sat.
[2024-09-10 13:31:28,302] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed info: version=0.14.4, git-hash=unknown, git-branch=unknown
[2024-09-10 13:31:28,303] [WARNING] [config_utils.py:69:_process_deprecated_field] Config parameter cpu_offload is deprecated use offload_optimizer instead
[2024-09-10 13:31:28,390] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Flops Profiler Enabled: False
Using /root/.cache/torch_extensions/py312_cu121 as PyTorch extensions root...
Detected CUDA files, patching ldflags
Emitting ninja build file /root/.cache/torch_extensions/py312_cu121/fused_ema_adam/build.ninja...
/root/miniconda3/envs/cogvideo/lib/python3.12/site-packages/torch/utils/cpp_extension.py:1965: UserWarning: TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'].
  warnings.warn(
Building extension module fused_ema_adam...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module fused_ema_adam...
Time to load fused_ema_adam op: 0.7197697162628174 seconds
[2024-09-10 13:31:29,264] [INFO] [logging.py:96:log_dist] [Rank 0] Using client callable to create basic optimizer
[2024-09-10 13:31:29,264] [INFO] [logging.py:96:log_dist] [Rank 0] Removing param_group that has no 'params' in the basic Optimizer
[2024-09-10 13:31:29,284] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Basic Optimizer = FusedEmaAdam
[2024-09-10 13:31:29,284] [INFO] [utils.py:56:is_zero_supported_optimizer] Checking ZeRO support for optimizer=FusedEmaAdam type=<class 'sat.ops.fused_ema_adam.FusedEmaAdam'>
[2024-09-10 13:31:29,284] [WARNING] [engine.py:1179:_do_optimizer_sanity_check] **** You are using ZeRO with an untested optimizer, proceed with caution *****
[2024-09-10 13:31:29,284] [INFO] [logging.py:96:log_dist] [Rank 0] Creating torch.float16 ZeRO stage 2 optimizer
[2024-09-10 13:31:29,284] [INFO] [stage_1_and_2.py:148:__init__] Reduce bucket size 1000000000
[2024-09-10 13:31:29,284] [INFO] [stage_1_and_2.py:149:__init__] Allgather bucket size 1000000000
[2024-09-10 13:31:29,284] [INFO] [stage_1_and_2.py:150:__init__] CPU Offload: False
[2024-09-10 13:31:29,284] [INFO] [stage_1_and_2.py:151:__init__] Round robin gradient partitioning: False
[2024-09-10 13:31:31,672] [INFO] [utils.py:781:see_memory_usage] Before initializing optimizer states
[2024-09-10 13:31:31,673] [INFO] [utils.py:782:see_memory_usage] MA 12.86 GB         Max_MA 12.97 GB         CA 13.23 GB         Max_CA 13 GB 
[2024-09-10 13:31:31,673] [INFO] [utils.py:789:see_memory_usage] CPU Virtual Memory:  used = 28.67 GB, percent = 1.5%
[2024-09-10 13:31:31,880] [INFO] [utils.py:781:see_memory_usage] After initializing optimizer states
[2024-09-10 13:31:31,880] [INFO] [utils.py:782:see_memory_usage] MA 12.86 GB         Max_MA 13.08 GB         CA 13.45 GB         Max_CA 13 GB 
[2024-09-10 13:31:31,880] [INFO] [utils.py:789:see_memory_usage] CPU Virtual Memory:  used = 28.41 GB, percent = 1.5%
[2024-09-10 13:31:31,880] [INFO] [stage_1_and_2.py:543:__init__] optimizer state initialized
[2024-09-10 13:31:32,107] [INFO] [utils.py:781:see_memory_usage] After initializing ZeRO optimizer
[2024-09-10 13:31:32,107] [INFO] [utils.py:782:see_memory_usage] MA 12.86 GB         Max_MA 12.86 GB         CA 13.45 GB         Max_CA 13 GB 
[2024-09-10 13:31:32,107] [INFO] [utils.py:789:see_memory_usage] CPU Virtual Memory:  used = 28.58 GB, percent = 1.5%
[2024-09-10 13:31:32,111] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Final Optimizer = DeepSpeedZeroOptimizer
[2024-09-10 13:31:32,112] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed using client LR scheduler
[2024-09-10 13:31:32,112] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed LR Scheduler = None
[2024-09-10 13:31:32,112] [INFO] [logging.py:96:log_dist] [Rank 0] step=0, skipped=0, lr=[1.0], mom=[[0.9, 0.95]]
[2024-09-10 13:31:32,114] [INFO] [config.py:997:print] DeepSpeedEngine configuration:
[2024-09-10 13:31:32,114] [INFO] [config.py:1001:print]   activation_checkpointing_config  {
    "partition_activations": false, 
    "contiguous_memory_optimization": false, 
    "cpu_checkpointing": false, 
    "number_checkpoints": null, 
    "synchronize_checkpoint_boundary": false, 
    "profile": false
}
[2024-09-10 13:31:32,115] [INFO] [config.py:1001:print]   aio_config ................... {'block_size': 1048576, 'queue_depth': 8, 'thread_count': 1, 'single_submit': False, 'overlap_events': True}
[2024-09-10 13:31:32,115] [INFO] [config.py:1001:print]   amp_enabled .................. False
[2024-09-10 13:31:32,115] [INFO] [config.py:1001:print]   amp_params ................... False
[2024-09-10 13:31:32,115] [INFO] [config.py:1001:print]   autotuning_config ............ {
    "enabled": false, 
    "start_step": null, 
    "end_step": null, 
    "metric_path": null, 
    "arg_mappings": null, 
    "metric": "throughput", 
    "model_info": null, 
    "results_dir": "autotuning_results", 
    "exps_dir": "autotuning_exps", 
    "overwrite": true, 
    "fast": true, 
    "start_profile_step": 3, 
    "end_profile_step": 5, 
    "tuner_type": "gridsearch", 
    "tuner_early_stopping": 5, 
    "tuner_num_trials": 50, 
    "model_info_path": null, 
    "mp_size": 1, 
    "max_train_batch_size": null, 
    "min_train_batch_size": 1, 
    "max_train_micro_batch_size_per_gpu": 1.024000e+03, 
    "min_train_micro_batch_size_per_gpu": 1, 
    "num_tuning_micro_batch_sizes": 3
}
[2024-09-10 13:31:32,115] [INFO] [config.py:1001:print]   bfloat16_enabled ............. False
[2024-09-10 13:31:32,115] [INFO] [config.py:1001:print]   bfloat16_immediate_grad_update  False
[2024-09-10 13:31:32,115] [INFO] [config.py:1001:print]   checkpoint_parallel_write_pipeline  False
[2024-09-10 13:31:32,115] [INFO] [config.py:1001:print]   checkpoint_tag_validation_enabled  True
[2024-09-10 13:31:32,115] [INFO] [config.py:1001:print]   checkpoint_tag_validation_fail  False
[2024-09-10 13:31:32,115] [INFO] [config.py:1001:print]   comms_config ................. <deepspeed.comm.config.DeepSpeedCommsConfig object at 0x7fa0f4c776e0>
[2024-09-10 13:31:32,115] [INFO] [config.py:1001:print]   communication_data_type ...... None
[2024-09-10 13:31:32,115] [INFO] [config.py:1001:print]   compression_config ........... {'weight_quantization': {'shared_parameters': {'enabled': False, 'quantizer_kernel': False, 'schedule_offset': 0, 'quantize_groups': 1, 'quantize_verbose': False, 'quantization_type': 'symmetric', 'quantize_weight_in_forward': False, 'rounding': 'nearest', 'fp16_mixed_quantize': False, 'quantize_change_ratio': 0.001}, 'different_groups': {}}, 'activation_quantization': {'shared_parameters': {'enabled': False, 'quantization_type': 'symmetric', 'range_calibration': 'dynamic', 'schedule_offset': 1000}, 'different_groups': {}}, 'sparse_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'row_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'head_pruning': {'shared_parameters': {'enabled': False, 'method': 'topk', 'schedule_offset': 1000}, 'different_groups': {}}, 'channel_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'layer_reduction': {'enabled': False}}
[2024-09-10 13:31:32,115] [INFO] [config.py:1001:print]   curriculum_enabled_legacy .... False
[2024-09-10 13:31:32,115] [INFO] [config.py:1001:print]   curriculum_params_legacy ..... False
[2024-09-10 13:31:32,115] [INFO] [config.py:1001:print]   data_efficiency_config ....... {'enabled': False, 'seed': 1234, 'data_sampling': {'enabled': False, 'num_epochs': 1000, 'num_workers': 0, 'curriculum_learning': {'enabled': False}}, 'data_routing': {'enabled': False, 'random_ltd': {'enabled': False, 'layer_token_lr_schedule': {'enabled': False}}}}
[2024-09-10 13:31:32,115] [INFO] [config.py:1001:print]   data_efficiency_enabled ...... False
[2024-09-10 13:31:32,115] [INFO] [config.py:1001:print]   dataloader_drop_last ......... False
[2024-09-10 13:31:32,115] [INFO] [config.py:1001:print]   disable_allgather ............ False
[2024-09-10 13:31:32,115] [INFO] [config.py:1001:print]   dump_state ................... False
[2024-09-10 13:31:32,115] [INFO] [config.py:1001:print]   dynamic_loss_scale_args ...... None
[2024-09-10 13:31:32,115] [INFO] [config.py:1001:print]   eigenvalue_enabled ........... False
[2024-09-10 13:31:32,115] [INFO] [config.py:1001:print]   eigenvalue_gas_boundary_resolution  1
[2024-09-10 13:31:32,115] [INFO] [config.py:1001:print]   eigenvalue_layer_name ........ bert.encoder.layer
[2024-09-10 13:31:32,115] [INFO] [config.py:1001:print]   eigenvalue_layer_num ......... 0
[2024-09-10 13:31:32,115] [INFO] [config.py:1001:print]   eigenvalue_max_iter .......... 100
[2024-09-10 13:31:32,115] [INFO] [config.py:1001:print]   eigenvalue_stability ......... 1e-06
[2024-09-10 13:31:32,115] [INFO] [config.py:1001:print]   eigenvalue_tol ............... 0.01
[2024-09-10 13:31:32,115] [INFO] [config.py:1001:print]   eigenvalue_verbose ........... False
[2024-09-10 13:31:32,115] [INFO] [config.py:1001:print]   elasticity_enabled ........... False
[2024-09-10 13:31:32,115] [INFO] [config.py:1001:print]   flops_profiler_config ........ {
    "enabled": false, 
    "recompute_fwd_factor": 0.0, 
    "profile_step": 1, 
    "module_depth": -1, 
    "top_modules": 1, 
    "detailed": true, 
    "output_file": null
}
[2024-09-10 13:31:32,115] [INFO] [config.py:1001:print]   fp16_auto_cast ............... False
[2024-09-10 13:31:32,115] [INFO] [config.py:1001:print]   fp16_enabled ................. True
[2024-09-10 13:31:32,116] [INFO] [config.py:1001:print]   fp16_master_weights_and_gradients  False
[2024-09-10 13:31:32,116] [INFO] [config.py:1001:print]   global_rank .................. 0
[2024-09-10 13:31:32,116] [INFO] [config.py:1001:print]   grad_accum_dtype ............. None
[2024-09-10 13:31:32,116] [INFO] [config.py:1001:print]   gradient_accumulation_steps .. 1
[2024-09-10 13:31:32,116] [INFO] [config.py:1001:print]   gradient_clipping ............ 0.1
[2024-09-10 13:31:32,116] [INFO] [config.py:1001:print]   gradient_predivide_factor .... 1.0
[2024-09-10 13:31:32,116] [INFO] [config.py:1001:print]   graph_harvesting ............. False
[2024-09-10 13:31:32,116] [INFO] [config.py:1001:print]   hybrid_engine ................ enabled=False max_out_tokens=512 inference_tp_size=1 release_inference_cache=False pin_parameters=True tp_gather_partition_size=8
[2024-09-10 13:31:32,116] [INFO] [config.py:1001:print]   initial_dynamic_scale ........ 65536
[2024-09-10 13:31:32,116] [INFO] [config.py:1001:print]   load_universal_checkpoint .... False
[2024-09-10 13:31:32,116] [INFO] [config.py:1001:print]   loss_scale ................... 0
[2024-09-10 13:31:32,116] [INFO] [config.py:1001:print]   memory_breakdown ............. False
[2024-09-10 13:31:32,116] [INFO] [config.py:1001:print]   mics_hierarchial_params_gather  False
[2024-09-10 13:31:32,116] [INFO] [config.py:1001:print]   mics_shard_size .............. -1
[2024-09-10 13:31:32,116] [INFO] [config.py:1001:print]   monitor_config ............... tensorboard=TensorBoardConfig(enabled=False, output_path='', job_name='DeepSpeedJobName') comet=CometConfig(enabled=False, samples_log_interval=100, project=None, workspace=None, api_key=None, experiment_name=None, experiment_key=None, online=None, mode=None) wandb=WandbConfig(enabled=False, group=None, team=None, project='deepspeed') csv_monitor=CSVConfig(enabled=False, output_path='', job_name='DeepSpeedJobName') enabled=False
[2024-09-10 13:31:32,116] [INFO] [config.py:1001:print]   nebula_config ................ {
    "enabled": false, 
    "persistent_storage_path": null, 
    "persistent_time_interval": 100, 
    "num_of_version_in_retention": 2, 
    "enable_nebula_load": true, 
    "load_path": null
}
[2024-09-10 13:31:32,116] [INFO] [config.py:1001:print]   optimizer_legacy_fusion ...... False
[2024-09-10 13:31:32,116] [INFO] [config.py:1001:print]   optimizer_name ............... None
[2024-09-10 13:31:32,116] [INFO] [config.py:1001:print]   optimizer_params ............. None
[2024-09-10 13:31:32,116] [INFO] [config.py:1001:print]   pipeline ..................... {'stages': 'auto', 'partition': 'best', 'seed_layers': False, 'activation_checkpoint_interval': 0, 'pipe_partitioned': True, 'grad_partitioned': True}
[2024-09-10 13:31:32,116] [INFO] [config.py:1001:print]   pld_enabled .................. False
[2024-09-10 13:31:32,116] [INFO] [config.py:1001:print]   pld_params ................... False
[2024-09-10 13:31:32,116] [INFO] [config.py:1001:print]   prescale_gradients ........... False
[2024-09-10 13:31:32,116] [INFO] [config.py:1001:print]   scheduler_name ............... None
[2024-09-10 13:31:32,116] [INFO] [config.py:1001:print]   scheduler_params ............. None
[2024-09-10 13:31:32,116] [INFO] [config.py:1001:print]   seq_parallel_communication_data_type  torch.float32
[2024-09-10 13:31:32,116] [INFO] [config.py:1001:print]   sparse_attention ............. None
[2024-09-10 13:31:32,116] [INFO] [config.py:1001:print]   sparse_gradients_enabled ..... False
[2024-09-10 13:31:32,116] [INFO] [config.py:1001:print]   steps_per_print .............. 50
[2024-09-10 13:31:32,116] [INFO] [config.py:1001:print]   timers_config ................ enabled=True synchronized=True
[2024-09-10 13:31:32,116] [INFO] [config.py:1001:print]   train_batch_size ............. 1
[2024-09-10 13:31:32,116] [INFO] [config.py:1001:print]   train_micro_batch_size_per_gpu  1
[2024-09-10 13:31:32,116] [INFO] [config.py:1001:print]   use_data_before_expert_parallel_  False
[2024-09-10 13:31:32,116] [INFO] [config.py:1001:print]   use_node_local_storage ....... False
[2024-09-10 13:31:32,116] [INFO] [config.py:1001:print]   wall_clock_breakdown ......... False
[2024-09-10 13:31:32,116] [INFO] [config.py:1001:print]   weight_quantization_config ... None
[2024-09-10 13:31:32,116] [INFO] [config.py:1001:print]   world_size ................... 1
[2024-09-10 13:31:32,116] [INFO] [config.py:1001:print]   zero_allow_untested_optimizer  True
[2024-09-10 13:31:32,116] [INFO] [config.py:1001:print]   zero_config .................. stage=2 contiguous_gradients=False reduce_scatter=True reduce_bucket_size=1000000000 use_multi_rank_bucket_allreduce=True allgather_partitions=True allgather_bucket_size=1000000000 overlap_comm=True load_from_fp32_weights=False elastic_checkpoint=False offload_param=None offload_optimizer=None sub_group_size=1,000,000,000 cpu_offload_param=None cpu_offload_use_pin_memory=None prefetch_bucket_size=50,000,000 param_persistence_threshold=100,000 model_persistence_threshold=sys.maxsize max_live_parameters=1,000,000,000 max_reuse_distance=1,000,000,000 gather_16bit_weights_on_model_save=False use_all_reduce_for_fetch_params=False stage3_gather_fp16_weights_on_model_save=False ignore_unused_parameters=True legacy_stage1=False round_robin_gradients=False zero_hpz_partition_size=1 zero_quantized_weights=False zero_quantized_nontrainable_weights=False zero_quantized_gradients=False mics_shard_size=-1 mics_hierarchical_params_gather=False memory_efficient_linear=True pipeline_loading_checkpoint=False override_module_apply=True
[2024-09-10 13:31:32,116] [INFO] [config.py:1001:print]   zero_enabled ................. True
[2024-09-10 13:31:32,116] [INFO] [config.py:1001:print]   zero_force_ds_cpu_optimizer .. True
[2024-09-10 13:31:32,116] [INFO] [config.py:1001:print]   zero_optimization_stage ...... 2
[2024-09-10 13:31:32,117] [INFO] [config.py:987:print_user_config]   json = {
    "train_micro_batch_size_per_gpu": 1, 
    "gradient_accumulation_steps": 1, 
    "steps_per_print": 50, 
    "gradient_clipping": 0.1, 
    "zero_optimization": {
        "stage": 2, 
        "cpu_offload": false, 
        "contiguous_gradients": false, 
        "overlap_comm": true, 
        "reduce_scatter": true, 
        "reduce_bucket_size": 1.000000e+09, 
        "allgather_bucket_size": 1.000000e+09, 
        "load_from_fp32_weights": false
    }, 
    "zero_allow_untested_optimizer": true, 
    "bf16": {
        "enabled": false
    }, 
    "fp16": {
        "enabled": true
    }, 
    "loss_scale": 0, 
    "loss_scale_window": 400, 
    "hysteresis": 2, 
    "min_loss_scale": 1, 
    "activation_checkpointing": {
        "partition_activations": false, 
        "contiguous_memory_optimization": false
    }, 
    "wall_clock_breakdown": false
}
[2024-09-10 13:31:32,117] [INFO] [RANK 0] learning rate decaying style linear, ratio 10.0
[2024-09-10 13:31:32,117] [INFO] [RANK 0] Finetuning Model...
[2024-09-10 13:31:32,117] [INFO] [RANK 0] arguments:
[2024-09-10 13:31:32,117] [INFO] [RANK 0]   base ......................... ['configs/cogvideox_2b_lora.yaml', 'configs/sft.yaml']
[2024-09-10 13:31:32,117] [INFO] [RANK 0]   model_parallel_size .......... 1
[2024-09-10 13:31:32,117] [INFO] [RANK 0]   force_pretrain ............... False
[2024-09-10 13:31:32,117] [INFO] [RANK 0]   device ....................... 0
[2024-09-10 13:31:32,117] [INFO] [RANK 0]   debug ........................ False
[2024-09-10 13:31:32,117] [INFO] [RANK 0]   log_image .................... True
[2024-09-10 13:31:32,117] [INFO] [RANK 0]   output_dir ................... samples
[2024-09-10 13:31:32,117] [INFO] [RANK 0]   input_dir .................... None
[2024-09-10 13:31:32,117] [INFO] [RANK 0]   input_type ................... cli
[2024-09-10 13:31:32,117] [INFO] [RANK 0]   input_file ................... input.txt
[2024-09-10 13:31:32,117] [INFO] [RANK 0]   final_size ................... 2048
[2024-09-10 13:31:32,117] [INFO] [RANK 0]   sdedit ....................... False
[2024-09-10 13:31:32,117] [INFO] [RANK 0]   grid_num_rows ................ 1
[2024-09-10 13:31:32,117] [INFO] [RANK 0]   force_inference .............. False
[2024-09-10 13:31:32,117] [INFO] [RANK 0]   lcm_steps .................... None
[2024-09-10 13:31:32,117] [INFO] [RANK 0]   sampling_num_frames .......... 32
[2024-09-10 13:31:32,117] [INFO] [RANK 0]   sampling_fps ................. 8
[2024-09-10 13:31:32,117] [INFO] [RANK 0]   only_save_latents ............ False
[2024-09-10 13:31:32,117] [INFO] [RANK 0]   only_log_video_latents ....... True
[2024-09-10 13:31:32,117] [INFO] [RANK 0]   latent_channels .............. 32
[2024-09-10 13:31:32,117] [INFO] [RANK 0]   image2video .................. False
[2024-09-10 13:31:32,117] [INFO] [RANK 0]   experiment_name .............. lora-test-09-10-13-30
[2024-09-10 13:31:32,117] [INFO] [RANK 0]   train_iters .................. 100
[2024-09-10 13:31:32,117] [INFO] [RANK 0]   batch_size ................... 1
[2024-09-10 13:31:32,117] [INFO] [RANK 0]   lr ........................... 0.001
[2024-09-10 13:31:32,117] [INFO] [RANK 0]   mode ......................... finetune
[2024-09-10 13:31:32,117] [INFO] [RANK 0]   seed ......................... 27481
[2024-09-10 13:31:32,117] [INFO] [RANK 0]   zero_stage ................... 0
[2024-09-10 13:31:32,117] [INFO] [RANK 0]   checkpoint_activations ....... True
[2024-09-10 13:31:32,117] [INFO] [RANK 0]   checkpoint_num_layers ........ 1
[2024-09-10 13:31:32,117] [INFO] [RANK 0]   checkpoint_skip_layers ....... 0
[2024-09-10 13:31:32,118] [INFO] [RANK 0]   fp16 ......................... True
[2024-09-10 13:31:32,118] [INFO] [RANK 0]   bf16 ......................... False
[2024-09-10 13:31:32,118] [INFO] [RANK 0]   gradient_accumulation_steps .. 1
[2024-09-10 13:31:32,118] [INFO] [RANK 0]   profiling .................... -1
[2024-09-10 13:31:32,118] [INFO] [RANK 0]   epochs ....................... None
[2024-09-10 13:31:32,118] [INFO] [RANK 0]   log_interval ................. 20
[2024-09-10 13:31:32,118] [INFO] [RANK 0]   summary_dir .................. 
[2024-09-10 13:31:32,118] [INFO] [RANK 0]   save_args .................... False
[2024-09-10 13:31:32,118] [INFO] [RANK 0]   lr_decay_iters ............... None
[2024-09-10 13:31:32,118] [INFO] [RANK 0]   lr_decay_style ............... linear
[2024-09-10 13:31:32,118] [INFO] [RANK 0]   lr_decay_ratio ............... 0.1
[2024-09-10 13:31:32,118] [INFO] [RANK 0]   warmup ....................... 0.01
[2024-09-10 13:31:32,118] [INFO] [RANK 0]   weight_decay ................. 0.0001
[2024-09-10 13:31:32,118] [INFO] [RANK 0]   save ......................... ckpts_2b_lora/lora-test-09-10-13-30
[2024-09-10 13:31:32,118] [INFO] [RANK 0]   load ......................... /root/CogVideo/CogVideoX-2b-sat/transformer
[2024-09-10 13:31:32,118] [INFO] [RANK 0]   force_train .................. True
[2024-09-10 13:31:32,118] [INFO] [RANK 0]   save_interval ................ 50
[2024-09-10 13:31:32,118] [INFO] [RANK 0]   no_save_rng .................. False
[2024-09-10 13:31:32,118] [INFO] [RANK 0]   no_load_rng .................. True
[2024-09-10 13:31:32,118] [INFO] [RANK 0]   resume_dataloader ............ False
[2024-09-10 13:31:32,118] [INFO] [RANK 0]   distributed_backend .......... nccl
[2024-09-10 13:31:32,118] [INFO] [RANK 0]   local_rank ................... 0
[2024-09-10 13:31:32,118] [INFO] [RANK 0]   exit_interval ................ None
[2024-09-10 13:31:32,118] [INFO] [RANK 0]   wandb ........................ False
[2024-09-10 13:31:32,118] [INFO] [RANK 0]   wandb_project_name ........... default_project
[2024-09-10 13:31:32,118] [INFO] [RANK 0]   eval_batch_size .............. 1
[2024-09-10 13:31:32,118] [INFO] [RANK 0]   eval_iters ................... 1
[2024-09-10 13:31:32,118] [INFO] [RANK 0]   eval_interval ................ 10
[2024-09-10 13:31:32,118] [INFO] [RANK 0]   strict_eval .................. False
[2024-09-10 13:31:32,118] [INFO] [RANK 0]   train_data ................... ['/root/CogVideo/sat/datasets/test']
[2024-09-10 13:31:32,118] [INFO] [RANK 0]   train_data_weights ........... None
[2024-09-10 13:31:32,118] [INFO] [RANK 0]   iterable_dataset ............. False
[2024-09-10 13:31:32,118] [INFO] [RANK 0]   iterable_dataset_eval ........ 
[2024-09-10 13:31:32,118] [INFO] [RANK 0]   batch_from_same_dataset ...... False
[2024-09-10 13:31:32,118] [INFO] [RANK 0]   valid_data ................... ['/root/CogVideo/sat/datasets/test']
[2024-09-10 13:31:32,118] [INFO] [RANK 0]   test_data .................... None
[2024-09-10 13:31:32,118] [INFO] [RANK 0]   split ........................ 1,0,0
[2024-09-10 13:31:32,118] [INFO] [RANK 0]   num_workers .................. 8
[2024-09-10 13:31:32,118] [INFO] [RANK 0]   block_size ................... 10000
[2024-09-10 13:31:32,118] [INFO] [RANK 0]   prefetch_factor .............. 4
[2024-09-10 13:31:32,118] [INFO] [RANK 0]   deepspeed .................... True
[2024-09-10 13:31:32,118] [INFO] [RANK 0]   deepspeed_config ............. {'train_micro_batch_size_per_gpu': 1, 'gradient_accumulation_steps': 1, 'steps_per_print': 50, 'gradient_clipping': 0.1, 'zero_optimization': {'stage': 2, 'cpu_offload': False, 'contiguous_gradients': False, 'overlap_comm': True, 'reduce_scatter': True, 'reduce_bucket_size': 1000000000, 'allgather_bucket_size': 1000000000, 'load_from_fp32_weights': False}, 'zero_allow_untested_optimizer': True, 'bf16': {'enabled': False}, 'fp16': {'enabled': True}, 'loss_scale': 0, 'loss_scale_window': 400, 'hysteresis': 2, 'min_loss_scale': 1, 'activation_checkpointing': {'partition_activations': False, 'contiguous_memory_optimization': False}, 'wall_clock_breakdown': False}
[2024-09-10 13:31:32,118] [INFO] [RANK 0]   deepscale .................... False
[2024-09-10 13:31:32,118] [INFO] [RANK 0]   deepscale_config ............. None
[2024-09-10 13:31:32,119] [INFO] [RANK 0]   model_config ................. {'scale_factor': 1.15258426, 'disable_first_stage_autocast': True, 'not_trainable_prefixes': ['all'], 'log_keys': ['txt'], 'denoiser_config': {'target': 'sgm.modules.diffusionmodules.denoiser.DiscreteDenoiser', 'params': {'num_idx': 1000, 'quantize_c_noise': False, 'weighting_config': {'target': 'sgm.modules.diffusionmodules.denoiser_weighting.EpsWeighting'}, 'scaling_config': {'target': 'sgm.modules.diffusionmodules.denoiser_scaling.VideoScaling'}, 'discretization_config': {'target': 'sgm.modules.diffusionmodules.discretizer.ZeroSNRDDPMDiscretization', 'params': {'shift_scale': 3.0}}}}, 'network_config': {'target': 'dit_video_concat.DiffusionTransformer', 'params': {'time_embed_dim': 512, 'elementwise_affine': True, 'num_frames': 49, 'time_compressed_rate': 4, 'latent_width': 90, 'latent_height': 60, 'num_layers': 30, 'patch_size': 2, 'in_channels': 16, 'out_channels': 16, 'hidden_size': 1920, 'adm_in_channels': 256, 'num_attention_heads': 30, 'transformer_args': {'checkpoint_activations': True, 'vocab_size': 1, 'max_sequence_length': 64, 'layernorm_order': 'pre', 'skip_init': False, 'model_parallel_size': 1, 'is_decoder': False, 'num_layers': 30, 'hidden_size': 1920, 'num_attention_heads': 30, 'parallel_output': True}, 'modules': {'pos_embed_config': {'target': 'dit_video_concat.Basic3DPositionEmbeddingMixin', 'params': {'text_length': 226, 'height_interpolation': 1.875, 'width_interpolation': 1.875}}, 'lora_config': {'target': 'sat.model.finetune.lora2.LoraMixin', 'params': {'r': 128}}, 'patch_embed_config': {'target': 'dit_video_concat.ImagePatchEmbeddingMixin', 'params': {'text_hidden_size': 4096}}, 'adaln_layer_config': {'target': 'dit_video_concat.AdaLNMixin', 'params': {'qk_ln': True}}, 'final_layer_config': {'target': 'dit_video_concat.FinalLayerMixin'}}, 'dtype': 'fp16'}}, 'conditioner_config': {'target': 'sgm.modules.GeneralConditioner', 'params': {'emb_models': [{'is_trainable': False, 'input_key': 'txt', 'ucg_rate': 0.1, 'target': 'sgm.modules.encoders.modules.FrozenT5Embedder', 'params': {'model_dir': '/root/CogVideo/t5-v1_1-xxl', 'max_length': 226}}]}}, 'first_stage_config': {'target': 'vae_modules.autoencoder.VideoAutoencoderInferenceWrapper', 'params': {'cp_size': 1, 'ckpt_path': '/root/CogVideo/CogVideoX-2b-sat/vae/3d-vae.pt', 'ignore_keys': ['loss'], 'loss_config': {'target': 'torch.nn.Identity'}, 'regularizer_config': {'target': 'vae_modules.regularizers.DiagonalGaussianRegularizer'}, 'encoder_config': {'target': 'vae_modules.cp_enc_dec.ContextParallelEncoder3D', 'params': {'double_z': True, 'z_channels': 16, 'resolution': 256, 'in_channels': 3, 'out_ch': 3, 'ch': 128, 'ch_mult': [1, 2, 2, 4], 'attn_resolutions': [], 'num_res_blocks': 3, 'dropout': 0.0, 'gather_norm': True}}, 'decoder_config': {'target': 'vae_modules.cp_enc_dec.ContextParallelDecoder3D', 'params': {'double_z': True, 'z_channels': 16, 'resolution': 256, 'in_channels': 3, 'out_ch': 3, 'ch': 128, 'ch_mult': [1, 2, 2, 4], 'attn_resolutions': [], 'num_res_blocks': 3, 'dropout': 0.0, 'gather_norm': False}}}}, 'loss_fn_config': {'target': 'sgm.modules.diffusionmodules.loss.VideoDiffusionLoss', 'params': {'offset_noise_level': 0, 'sigma_sampler_config': {'target': 'sgm.modules.diffusionmodules.sigma_sampling.DiscreteSampling', 'params': {'uniform_sampling': True, 'num_idx': 1000, 'discretization_config': {'target': 'sgm.modules.diffusionmodules.discretizer.ZeroSNRDDPMDiscretization', 'params': {'shift_scale': 3.0}}}}}}, 'sampler_config': {'target': 'sgm.modules.diffusionmodules.sampling.VPSDEDPMPP2MSampler', 'params': {'num_steps': 50, 'verbose': True, 'discretization_config': {'target': 'sgm.modules.diffusionmodules.discretizer.ZeroSNRDDPMDiscretization', 'params': {'shift_scale': 3.0}}, 'guider_config': {'target': 'sgm.modules.diffusionmodules.guiders.DynamicCFG', 'params': {'scale': 6, 'exp': 5, 'num_steps': 50}}}}}
[2024-09-10 13:31:32,119] [INFO] [RANK 0]   data_config .................. {'target': 'data_video.SFTDataset', 'params': {'video_size': [480, 720], 'fps': 8, 'max_num_frames': 49, 'skip_frms_num': 3.0}}
[2024-09-10 13:31:32,119] [INFO] [RANK 0]   cuda ......................... True
[2024-09-10 13:31:32,119] [INFO] [RANK 0]   rank ......................... 0
[2024-09-10 13:31:32,119] [INFO] [RANK 0]   world_size ................... 1
[2024-09-10 13:31:32,119] [INFO] [RANK 0]   deepspeed_activation_checkpointing  True
[2024-09-10 13:31:32,119] [INFO] [RANK 0]   master_ip .................... localhost
[2024-09-10 13:31:32,119] [INFO] [RANK 0]   master_port .................. 44107
[2024-09-10 13:31:32,119] [INFO] [RANK 0]   log_config ................... [{'model': {'scale_factor': 1.15258426, 'disable_first_stage_autocast': True, 'not_trainable_prefixes': ['all'], 'log_keys': ['txt'], 'denoiser_config': {'target': 'sgm.modules.diffusionmodules.denoiser.DiscreteDenoiser', 'params': {'num_idx': 1000, 'quantize_c_noise': False, 'weighting_config': {'target': 'sgm.modules.diffusionmodules.denoiser_weighting.EpsWeighting'}, 'scaling_config': {'target': 'sgm.modules.diffusionmodules.denoiser_scaling.VideoScaling'}, 'discretization_config': {'target': 'sgm.modules.diffusionmodules.discretizer.ZeroSNRDDPMDiscretization', 'params': {'shift_scale': 3.0}}}}, 'network_config': {'target': 'dit_video_concat.DiffusionTransformer', 'params': {'time_embed_dim': 512, 'elementwise_affine': True, 'num_frames': 49, 'time_compressed_rate': 4, 'latent_width': 90, 'latent_height': 60, 'num_layers': 30, 'patch_size': 2, 'in_channels': 16, 'out_channels': 16, 'hidden_size': 1920, 'adm_in_channels': 256, 'num_attention_heads': 30, 'transformer_args': {'checkpoint_activations': True, 'vocab_size': 1, 'max_sequence_length': 64, 'layernorm_order': 'pre', 'skip_init': False, 'model_parallel_size': 1, 'is_decoder': False}, 'modules': {'pos_embed_config': {'target': 'dit_video_concat.Basic3DPositionEmbeddingMixin', 'params': {'text_length': 226, 'height_interpolation': 1.875, 'width_interpolation': 1.875}}, 'lora_config': {'target': 'sat.model.finetune.lora2.LoraMixin', 'params': {'r': 128}}, 'patch_embed_config': {'target': 'dit_video_concat.ImagePatchEmbeddingMixin', 'params': {'text_hidden_size': 4096}}, 'adaln_layer_config': {'target': 'dit_video_concat.AdaLNMixin', 'params': {'qk_ln': True}}, 'final_layer_config': {'target': 'dit_video_concat.FinalLayerMixin'}}}}, 'conditioner_config': {'target': 'sgm.modules.GeneralConditioner', 'params': {'emb_models': [{'is_trainable': False, 'input_key': 'txt', 'ucg_rate': 0.1, 'target': 'sgm.modules.encoders.modules.FrozenT5Embedder', 'params': {'model_dir': '/root/CogVideo/t5-v1_1-xxl', 'max_length': 226}}]}}, 'first_stage_config': {'target': 'vae_modules.autoencoder.VideoAutoencoderInferenceWrapper', 'params': {'cp_size': 1, 'ckpt_path': '/root/CogVideo/CogVideoX-2b-sat/vae/3d-vae.pt', 'ignore_keys': ['loss'], 'loss_config': {'target': 'torch.nn.Identity'}, 'regularizer_config': {'target': 'vae_modules.regularizers.DiagonalGaussianRegularizer'}, 'encoder_config': {'target': 'vae_modules.cp_enc_dec.ContextParallelEncoder3D', 'params': {'double_z': True, 'z_channels': 16, 'resolution': 256, 'in_channels': 3, 'out_ch': 3, 'ch': 128, 'ch_mult': [1, 2, 2, 4], 'attn_resolutions': [], 'num_res_blocks': 3, 'dropout': 0.0, 'gather_norm': True}}, 'decoder_config': {'target': 'vae_modules.cp_enc_dec.ContextParallelDecoder3D', 'params': {'double_z': True, 'z_channels': 16, 'resolution': 256, 'in_channels': 3, 'out_ch': 3, 'ch': 128, 'ch_mult': [1, 2, 2, 4], 'attn_resolutions': [], 'num_res_blocks': 3, 'dropout': 0.0, 'gather_norm': False}}}}, 'loss_fn_config': {'target': 'sgm.modules.diffusionmodules.loss.VideoDiffusionLoss', 'params': {'offset_noise_level': 0, 'sigma_sampler_config': {'target': 'sgm.modules.diffusionmodules.sigma_sampling.DiscreteSampling', 'params': {'uniform_sampling': True, 'num_idx': 1000, 'discretization_config': {'target': 'sgm.modules.diffusionmodules.discretizer.ZeroSNRDDPMDiscretization', 'params': {'shift_scale': 3.0}}}}}}, 'sampler_config': {'target': 'sgm.modules.diffusionmodules.sampling.VPSDEDPMPP2MSampler', 'params': {'num_steps': 50, 'verbose': True, 'discretization_config': {'target': 'sgm.modules.diffusionmodules.discretizer.ZeroSNRDDPMDiscretization', 'params': {'shift_scale': 3.0}}, 'guider_config': {'target': 'sgm.modules.diffusionmodules.guiders.DynamicCFG', 'params': {'scale': 6, 'exp': 5, 'num_steps': 50}}}}}}, {'args': {'checkpoint_activations': True, 'model_parallel_size': 1, 'experiment_name': 'lora-test', 'mode': 'finetune', 'load': '/root/CogVideo/CogVideoX-2b-sat/transformer', 'no_load_rng': True, 'train_iters': 100, 'eval_iters': 1, 'eval_interval': 10, 'eval_batch_size': 1, 'save': 'ckpts_2b_lora', 'save_interval': 50, 'log_interval': 20, 'train_data': ['/root/CogVideo/sat/datasets/test'], 'valid_data': ['/root/CogVideo/sat/datasets/test'], 'split': '1,0,0', 'num_workers': 8, 'force_train': True, 'only_log_video_latents': True}, 'data': {'target': 'data_video.SFTDataset', 'params': {'video_size': [480, 720], 'fps': 8, 'max_num_frames': 49, 'skip_frms_num': 3.0}}, 'deepspeed': {'train_micro_batch_size_per_gpu': 1, 'gradient_accumulation_steps': 1, 'steps_per_print': 50, 'gradient_clipping': 0.1, 'zero_optimization': {'stage': 2, 'cpu_offload': False, 'contiguous_gradients': False, 'overlap_comm': True, 'reduce_scatter': True, 'reduce_bucket_size': 1000000000, 'allgather_bucket_size': 1000000000, 'load_from_fp32_weights': False}, 'zero_allow_untested_optimizer': True, 'bf16': {'enabled': False}, 'fp16': {'enabled': True}, 'loss_scale': 0, 'loss_scale_window': 400, 'hysteresis': 2, 'min_loss_scale': 1, 'optimizer': {'type': 'sat.ops.FusedEmaAdam', 'params': {'lr': 0.001, 'betas': [0.9, 0.95], 'eps': '1e-8', 'weight_decay': '1e-4'}}, 'activation_checkpointing': {'partition_activations': False, 'contiguous_memory_optimization': False}, 'wall_clock_breakdown': False}}]
[2024-09-10 13:31:32,119] [INFO] [RANK 0]   do_train ..................... True
[2024-09-10 13:31:32,119] [INFO] [RANK 0]   val_last_shape ............... []
[2024-09-10 13:31:32,119] [INFO] [RANK 0]   val_drop_number .............. 0
[2024-09-10 13:31:32,119] [INFO] [RANK 0]   do_valid ..................... True
[2024-09-10 13:31:32,119] [INFO] [RANK 0]   do_test ...................... False
[2024-09-10 13:31:32,119] [INFO] [RANK 0]   iteration .................... 0
[2024-09-10 13:32:00,623] [INFO] [checkpointing.py:541:forward] Activation Checkpointing Information
[2024-09-10 13:32:00,623] [INFO] [checkpointing.py:542:forward] ----Partition Activations False, CPU CHECKPOINTING False
[2024-09-10 13:32:00,623] [INFO] [checkpointing.py:543:forward] ----contiguous Memory Checkpointing False with None total layers
[2024-09-10 13:32:00,623] [INFO] [checkpointing.py:545:forward] ----Synchronization False
[2024-09-10 13:32:00,623] [INFO] [checkpointing.py:546:forward] ----Profiling time in checkpointing False
[2024-09-10 13:32:06,525] [INFO] [loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 4294967296, reducing to 2147483648
[2024-09-10 13:32:15,902] [INFO] [loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 2147483648, reducing to 1073741824
[2024-09-10 13:32:24,779] [INFO] [loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 1073741824, reducing to 536870912
[2024-09-10 13:32:33,800] [INFO] [loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 536870912, reducing to 268435456
[2024-09-10 13:32:43,291] [INFO] [loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 268435456, reducing to 134217728
[2024-09-10 13:33:28,030] [INFO] [loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 134217728, reducing to 67108864
/root/CogVideo/sat/train_video.py:67: DeprecationWarning: torch.get_autocast_gpu_dtype() is deprecated. Please use torch.get_autocast_dtype('cuda') instead. (Triggered internally at ../torch/csrc/autograd/init.cpp:733.)
  "dtype": torch.get_autocast_gpu_dtype(),
/root/CogVideo/sat/train_video.py:70: FutureWarning: `torch.cuda.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cuda', args...)` instead.
  with torch.no_grad(), torch.cuda.amp.autocast(**gpu_autocast_kwargs):
##############################  Sampling setting  ##############################
Sampler: VPSDEDPMPP2MSampler
Discretization: ZeroSNRDDPMDiscretization
Guider: DynamicCFG
Sampling with VPSDEDPMPP2MSampler for 51 steps:  98%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▏     | 50/51 [01:24<00:01,  1.70s/it]
[2024-09-10 13:34:59,554] [INFO] [RANK 0] ----------------------------------------------------------------------------------------------------
[2024-09-10 13:34:59,555] [INFO] [RANK 0] ----------------------------------------------------------------------------------------------
[2024-09-10 13:34:59,555] [INFO] [RANK 0]  validation loss at iteration 10 | loss: 1.391026E-01 | PPL: 1.149242E+00 loss 1.391026E-01 |
[2024-09-10 13:34:59,555] [INFO] [RANK 0] ----------------------------------------------------------------------------------------------
[2024-09-10 13:35:16,965] [INFO] [loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 67108864, reducing to 33554432
[2024-09-10 13:36:28,365] [INFO] [RANK 0]  iteration       20/     100 | elapsed time per iteration (ms): 14758.9 | learning rate 5.000E-05 | total loss 1.892787E-01 | loss 1.892786E-01 | loss scale 33554432.0 |speed 4.07 samples/(min*GPU)
[2024-09-10 13:36:28,366] [INFO] [RANK 0] after 20 iterations memory (MB) | allocated: 13974.6455078125 | max allocated: 38562.94677734375 | cached: 18572.0 | max cached: 53186.0
[2024-09-10 13:36:28,367] [INFO] [RANK 0] time (ms) | forward: 4717.04 | backward: 5432.07 | allreduce: 0.00 | optimizer: 32.39 | data loader: 90.08
##############################  Sampling setting  ##############################
Sampler: VPSDEDPMPP2MSampler
Discretization: ZeroSNRDDPMDiscretization
Guider: DynamicCFG
Sampling with VPSDEDPMPP2MSampler for 51 steps:  98%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▏     | 50/51 [01:24<00:01,  1.70s/it]
[2024-09-10 13:37:59,450] [INFO] [RANK 0] ----------------------------------------------------------------------------------------------------
[2024-09-10 13:37:59,450] [INFO] [RANK 0] ----------------------------------------------------------------------------------------------
[2024-09-10 13:37:59,450] [INFO] [RANK 0]  validation loss at iteration 20 | loss: 1.256772E-01 | PPL: 1.133916E+00 loss 1.256772E-01 |
[2024-09-10 13:37:59,450] [INFO] [RANK 0] ----------------------------------------------------------------------------------------------
##############################  Sampling setting  ##############################
Sampler: VPSDEDPMPP2MSampler
Discretization: ZeroSNRDDPMDiscretization
Guider: DynamicCFG
Sampling with VPSDEDPMPP2MSampler for 51 steps:  98%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▏     | 50/51 [01:25<00:01,  1.70s/it]
[2024-09-10 13:40:59,756] [INFO] [RANK 0] ----------------------------------------------------------------------------------------------------
[2024-09-10 13:40:59,756] [INFO] [RANK 0] ----------------------------------------------------------------------------------------------
[2024-09-10 13:40:59,756] [INFO] [RANK 0]  validation loss at iteration 30 | loss: 2.129551E-01 | PPL: 1.237329E+00 loss 2.129551E-01 |
[2024-09-10 13:40:59,756] [INFO] [RANK 0] ----------------------------------------------------------------------------------------------
[rank0]: Traceback (most recent call last):
[rank0]:   File "/root/CogVideo/sat/train_video.py", line 226, in <module>
[rank0]:     training_main(
[rank0]:   File "/root/miniconda3/envs/cogvideo/lib/python3.12/site-packages/sat/training/deepspeed_training.py", line 157, in training_main
[rank0]:     iteration, skipped = train(model, optimizer,
[rank0]:                          ^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/root/miniconda3/envs/cogvideo/lib/python3.12/site-packages/sat/training/deepspeed_training.py", line 359, in train
[rank0]:     lm_loss, skipped_iter, metrics = train_step(train_data_iterator,
[rank0]:                                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/root/miniconda3/envs/cogvideo/lib/python3.12/site-packages/sat/training/deepspeed_training.py", line 443, in train_step
[rank0]:     forward_ret = forward_step(data_iterator, model, args, timers, **kwargs)
[rank0]:                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/root/CogVideo/sat/train_video.py", line 176, in forward_step
[rank0]:     batch = next(data_iterator)
[rank0]:             ^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/root/miniconda3/envs/cogvideo/lib/python3.12/site-packages/torch/utils/data/dataloader.py", line 630, in __next__
[rank0]:     data = self._next_data()
[rank0]:            ^^^^^^^^^^^^^^^^^
[rank0]:   File "/root/miniconda3/envs/cogvideo/lib/python3.12/site-packages/torch/utils/data/dataloader.py", line 1324, in _next_data
[rank0]:     return self._process_data(data)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/root/miniconda3/envs/cogvideo/lib/python3.12/site-packages/torch/utils/data/dataloader.py", line 1370, in _process_data
[rank0]:     data.reraise()
[rank0]:   File "/root/miniconda3/envs/cogvideo/lib/python3.12/site-packages/torch/_utils.py", line 706, in reraise
[rank0]:     raise exception
[rank0]: ZeroDivisionError: Caught ZeroDivisionError in DataLoader worker process 6.
[rank0]: Original Traceback (most recent call last):
[rank0]:   File "/root/miniconda3/envs/cogvideo/lib/python3.12/site-packages/torch/utils/data/_utils/worker.py", line 309, in _worker_loop
[rank0]:     data = fetcher.fetch(index)  # type: ignore[possibly-undefined]
[rank0]:            ^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/root/miniconda3/envs/cogvideo/lib/python3.12/site-packages/torch/utils/data/_utils/fetch.py", line 52, in fetch
[rank0]:     data = [self.dataset[idx] for idx in possibly_batched_index]
[rank0]:             ~~~~~~~~~~~~^^^^^
[rank0]:   File "/root/miniconda3/envs/cogvideo/lib/python3.12/site-packages/sat/data_utils/configure_data.py", line 360, in __getitem__
[rank0]:     return self.wrapped_data[index]
[rank0]:            ~~~~~~~~~~~~~~~~~^^^^^^^
[rank0]:   File "/root/miniconda3/envs/cogvideo/lib/python3.12/site-packages/sat/data_utils/configure_data.py", line 342, in __getitem__
[rank0]:     return self.datasets[dataset_idx][sample_idx]
[rank0]:            ~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^
[rank0]:   File "/root/CogVideo/sat/data_video.py", line 411, in __getitem__
[rank0]:     indices = np.arange(start, end, (end - start) // num_frames).astype(int)
[rank0]:               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: ZeroDivisionError: division by zero

DONE on alphacode-ttv-a100-80g-gpu
(cogvideo) root@alphacode-ttv-a100-80g-gpu:~/CogVideo/sat# 

KihongK avatar Sep 09 '24 07:09 KihongK

Same issue on A100 80G I tried 2b and 5b version (fp16 & bf16) Reduced rl from 1e-3 to 1e-5 (see https://github.com/THUDM/ChatGLM-6B/issues/1008) but same error

AoqunJin avatar Sep 10 '24 01:09 AoqunJin

It is normal to skip when the loss is large at the beginning of training. You can find that a small number of steps will be skipped in the first 50 steps. Once the training is stable, it will not happen again.

tengjiayan20 avatar Sep 10 '24 16:09 tengjiayan20

Yes, that is right. @tengjiayan20 It recovered after few steps training:

[2024-09-11 17:52:11,320] [INFO] [checkpointing.py:546:forward] ----Profiling time in checkpointing False
[2024-09-11 17:52:18,030] [INFO] [loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 4294967296, reducing to 2147483648
[2024-09-11 17:52:32,563] [INFO] [loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 2147483648, reducing to 1073741824
[2024-09-11 17:52:47,082] [INFO] [loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 1073741824, reducing to 536870912
[2024-09-11 17:53:15,865] [INFO] [loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 536870912, reducing to 268435456
[2024-09-11 17:53:58,933] [INFO] [loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 268435456, reducing to 134217728
[2024-09-11 17:58:33,295] [INFO] [loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 134217728, reducing to 67108864
[2024-09-11 18:00:42,520] [INFO] [loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 67108864, reducing to 33554432
[2024-09-11 18:04:05,636] [INFO] [logging.py:96:log_dist] [Rank 0] step=50, skipped=7, lr=[5e-05], mom=[[0.9, 0.95]]
[2024-09-11 18:07:56,739] [INFO] [loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 33554432, reducing to 16777216
[2024-09-11 18:16:06,711] [INFO] [logging.py:96:log_dist] [Rank 0] step=100, skipped=8, lr=[5e-05], mom=[[0.9, 0.95]]
[2024-09-11 18:16:06,712] [INFO] [RANK 0]  iteration      100/   10000 | elapsed time per iteration (ms): 14623.5 | learning rate 5.000E-05 | total loss 1.992110E-01 | loss 1.992110E-01 | loss scale 16777216.0 |speed 8.21 samples/(min*GPU)
[2024-09-11 18:16:06,713] [INFO] [RANK 0] after 100 iterations memory (MB) | allocated: 13974.6455078125 | max allocated: 64453.90478515625 | cached: 22772.0 | max cached: 79914.0
[2024-09-11 18:16:06,713] [INFO] [RANK 0] time (ms) | forward: 9524.11 | backward: 5073.59 | allreduce: 0.00 | optimizer: 24.71 | data loader: 67.04

Thanks a lot.

AoqunJin avatar Sep 11 '24 10:09 AoqunJin