CogVideo
CogVideo copied to clipboard
[loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 67108864, reducing to 33554432
System Info / 系統信息
When I fine-tune CogVideoX-2B, i found that almost all the steps are skipped, and the loss scale is very large.
Information / 问题信息
- [X] The official example scripts / 官方的示例脚本
- [ ] My own modified scripts / 我自己修改的脚本和任务
Reproduction / 复现过程
just run:
#! /bin/bash
echo "RUN on hostname, CUDA_VISIBLE_DEVICES=$CUDA_VISIBLE_DEVICES"
environs="WORLD_SIZE=1 RANK=0 LOCAL_RANK=0 LOCAL_WORLD_SIZE=1"
run_cmd="$environs python train_video.py --base configs/cogvideox_2b_lora.yaml configs/sft.yaml --seed $RANDOM"
echo ${run_cmd} eval ${run_cmd}
echo "DONE on hostname"
Expected behavior / 期待表现
Is this normal?
nop, can you share the log?
(cogvideo) ubuntu@instance-butter:/data3/cx_workspace/CogV/CogVideo/sat$ bash finetune_single_gpu.sh
RUN on instance-butter, CUDA_VISIBLE_DEVICES=6
WORLD_SIZE=1 RANK=0 LOCAL_RANK=0 LOCAL_WORLD_SIZE=1 python train_video.py --base configs/cogvideox_2b_lora.yaml configs/sft.yaml --seed 22338
[2024-09-03 02:36:29,937] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[WARNING] async_io requires the dev libaio .so object and headers but these were not found.
[WARNING] async_io: please install the libaio-dev package with apt
[WARNING] If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
[WARNING] Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
[WARNING] sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.4
[WARNING] using untested triton version (3.0.0), only 1.0.0 is known to be compatible
/data1/anaconda3/envs/cogvideo/lib/python3.10/site-packages/deepspeed/runtime/zero/linear.py:49: FutureWarning: torch.cuda.amp.custom_fwd(args...) is deprecated. Please use torch.amp.custom_fwd(args..., device_type='cuda') instead.
def forward(ctx, input, weight, bias=None):
/data1/anaconda3/envs/cogvideo/lib/python3.10/site-packages/deepspeed/runtime/zero/linear.py:67: FutureWarning: torch.cuda.amp.custom_bwd(args...) is deprecated. Please use torch.amp.custom_bwd(args..., device_type='cuda') instead.
def backward(ctx, grad_output):
/data1/anaconda3/envs/cogvideo/lib/python3.10/site-packages/kornia/feature/lightglue.py:44: FutureWarning: torch.cuda.amp.custom_fwd(args...) is deprecated. Please use torch.amp.custom_fwd(args..., device_type='cuda') instead.
@torch.cuda.amp.custom_fwd(cast_inputs=torch.float32)
/data1/anaconda3/envs/cogvideo/lib/python3.10/site-packages/xformers/ops/fmha/flash.py:211: FutureWarning: torch.library.impl_abstract was renamed to torch.library.register_fake. Please use that instead; we will remove torch.library.impl_abstract in a future version of PyTorch.
@torch.library.impl_abstract("xformers_flash::flash_fwd")
/data1/anaconda3/envs/cogvideo/lib/python3.10/site-packages/xformers/ops/fmha/flash.py:344: FutureWarning: torch.library.impl_abstract was renamed to torch.library.register_fake. Please use that instead; we will remove torch.library.impl_abstract in a future version of PyTorch.
@torch.library.impl_abstract("xformers_flash::flash_bwd")
[2024-09-03 02:36:34,890] [INFO] using world size: 1
[2024-09-03 02:36:34,891] [INFO] Will override arguments with manually specified deepspeed_config!
[2024-09-03 02:36:34,893] [INFO] [RANK 0] > initializing model parallel with size 1
[2024-09-03 02:36:34,894] [INFO] [comm.py:637:init_distributed] cdb=None
[2024-09-03 02:36:34,922] [INFO] [RANK 0] building SATVideoDiffusionEngine model ...
[2024-09-03 02:36:44,771] [INFO] [RANK 0] replacing layer 0 attention with lora
[2024-09-03 02:36:44,904] [INFO] [RANK 0] replacing layer 1 attention with lora
[2024-09-03 02:36:45,021] [INFO] [RANK 0] replacing layer 2 attention with lora
[2024-09-03 02:36:45,131] [INFO] [RANK 0] replacing layer 3 attention with lora
[2024-09-03 02:36:45,241] [INFO] [RANK 0] replacing layer 4 attention with lora
[2024-09-03 02:36:45,352] [INFO] [RANK 0] replacing layer 5 attention with lora
[2024-09-03 02:36:45,466] [INFO] [RANK 0] replacing layer 6 attention with lora
[2024-09-03 02:36:45,575] [INFO] [RANK 0] replacing layer 7 attention with lora
[2024-09-03 02:36:45,687] [INFO] [RANK 0] replacing layer 8 attention with lora
[2024-09-03 02:36:45,851] [INFO] [RANK 0] replacing layer 9 attention with lora
[2024-09-03 02:36:45,957] [INFO] [RANK 0] replacing layer 10 attention with lora
[2024-09-03 02:36:46,063] [INFO] [RANK 0] replacing layer 11 attention with lora
[2024-09-03 02:36:46,173] [INFO] [RANK 0] replacing layer 12 attention with lora
[2024-09-03 02:36:46,280] [INFO] [RANK 0] replacing layer 13 attention with lora
[2024-09-03 02:36:46,387] [INFO] [RANK 0] replacing layer 14 attention with lora
[2024-09-03 02:36:46,495] [INFO] [RANK 0] replacing layer 15 attention with lora
[2024-09-03 02:36:46,606] [INFO] [RANK 0] replacing layer 16 attention with lora
[2024-09-03 02:36:46,761] [INFO] [RANK 0] replacing layer 17 attention with lora
[2024-09-03 02:36:46,901] [INFO] [RANK 0] replacing layer 18 attention with lora
[2024-09-03 02:36:47,044] [INFO] [RANK 0] replacing layer 19 attention with lora
[2024-09-03 02:36:47,171] [INFO] [RANK 0] replacing layer 20 attention with lora
[2024-09-03 02:36:47,291] [INFO] [RANK 0] replacing layer 21 attention with lora
[2024-09-03 02:36:47,397] [INFO] [RANK 0] replacing layer 22 attention with lora
[2024-09-03 02:36:47,506] [INFO] [RANK 0] replacing layer 23 attention with lora
[2024-09-03 02:36:47,610] [INFO] [RANK 0] replacing layer 24 attention with lora
[2024-09-03 02:36:47,774] [INFO] [RANK 0] replacing layer 25 attention with lora
[2024-09-03 02:36:47,881] [INFO] [RANK 0] replacing layer 26 attention with lora
[2024-09-03 02:36:47,986] [INFO] [RANK 0] replacing layer 27 attention with lora
[2024-09-03 02:36:48,095] [INFO] [RANK 0] replacing layer 28 attention with lora
[2024-09-03 02:36:48,208] [INFO] [RANK 0] replacing layer 29 attention with lora
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:04<00:00, 2.28s/it]
Initialized embedder #0: FrozenT5Embedder with 4762310656 params. Trainable: False
Working with z of shape (1, 16, 32, 32) = 16384 dimensions.
/data3/cx_workspace/CogV/CogVideo/sat/vae_modules/autoencoder.py:565: FutureWarning: You are using torch.load with weights_only=False (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for weights_only will be flipped to True. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via torch.serialization.add_safe_globals. We recommend you start setting weights_only=True for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature.
sd = torch.load(path, map_location="cpu")["state_dict"]
Deleting key loss.logvar from state_dict.
Deleting key loss.perceptual_loss.scaling_layer.shift from state_dict.
Deleting key loss.perceptual_loss.scaling_layer.scale from state_dict.
Deleting key loss.perceptual_loss.net.slice1.0.weight from state_dict.
Deleting key loss.perceptual_loss.net.slice1.0.bias from state_dict.
Deleting key loss.perceptual_loss.net.slice1.2.weight from state_dict.
Deleting key loss.perceptual_loss.net.slice1.2.bias from state_dict.
Deleting key loss.perceptual_loss.net.slice2.5.weight from state_dict.
Deleting key loss.perceptual_loss.net.slice2.5.bias from state_dict.
Deleting key loss.perceptual_loss.net.slice2.7.weight from state_dict.
Deleting key loss.perceptual_loss.net.slice2.7.bias from state_dict.
Deleting key loss.perceptual_loss.net.slice3.10.weight from state_dict.
Deleting key loss.perceptual_loss.net.slice3.10.bias from state_dict.
Deleting key loss.perceptual_loss.net.slice3.12.weight from state_dict.
Deleting key loss.perceptual_loss.net.slice3.12.bias from state_dict.
Deleting key loss.perceptual_loss.net.slice3.14.weight from state_dict.
Deleting key loss.perceptual_loss.net.slice3.14.bias from state_dict.
Deleting key loss.perceptual_loss.net.slice4.17.weight from state_dict.
Deleting key loss.perceptual_loss.net.slice4.17.bias from state_dict.
Deleting key loss.perceptual_loss.net.slice4.19.weight from state_dict.
Deleting key loss.perceptual_loss.net.slice4.19.bias from state_dict.
Deleting key loss.perceptual_loss.net.slice4.21.weight from state_dict.
Deleting key loss.perceptual_loss.net.slice4.21.bias from state_dict.
Deleting key loss.perceptual_loss.net.slice5.24.weight from state_dict.
Deleting key loss.perceptual_loss.net.slice5.24.bias from state_dict.
Deleting key loss.perceptual_loss.net.slice5.26.weight from state_dict.
Deleting key loss.perceptual_loss.net.slice5.26.bias from state_dict.
Deleting key loss.perceptual_loss.net.slice5.28.weight from state_dict.
Deleting key loss.perceptual_loss.net.slice5.28.bias from state_dict.
Deleting key loss.perceptual_loss.lin0.model.1.weight from state_dict.
Deleting key loss.perceptual_loss.lin1.model.1.weight from state_dict.
Deleting key loss.perceptual_loss.lin2.model.1.weight from state_dict.
Deleting key loss.perceptual_loss.lin3.model.1.weight from state_dict.
Deleting key loss.perceptual_loss.lin4.model.1.weight from state_dict.
Deleting key loss.discriminator.blocks.0.downsample_res.conv.weight from state_dict.
Deleting key loss.discriminator.blocks.0.downsample_res.conv.bias from state_dict.
Deleting key loss.discriminator.blocks.0.net.0.conv.weight from state_dict.
Deleting key loss.discriminator.blocks.0.net.0.conv.bias from state_dict.
Deleting key loss.discriminator.blocks.0.net.2.conv.weight from state_dict.
Deleting key loss.discriminator.blocks.0.net.2.conv.bias from state_dict.
Deleting key loss.discriminator.blocks.0.downsample.conv.weight from state_dict.
Deleting key loss.discriminator.blocks.0.downsample.conv.bias from state_dict.
Deleting key loss.discriminator.blocks.1.downsample_res.conv.weight from state_dict.
Deleting key loss.discriminator.blocks.1.downsample_res.conv.bias from state_dict.
Deleting key loss.discriminator.blocks.1.net.0.conv.weight from state_dict.
Deleting key loss.discriminator.blocks.1.net.0.conv.bias from state_dict.
Deleting key loss.discriminator.blocks.1.net.2.conv.weight from state_dict.
Deleting key loss.discriminator.blocks.1.net.2.conv.bias from state_dict.
Deleting key loss.discriminator.blocks.1.downsample.conv.weight from state_dict.
Deleting key loss.discriminator.blocks.1.downsample.conv.bias from state_dict.
Deleting key loss.discriminator.blocks.2.downsample_res.conv.weight from state_dict.
Deleting key loss.discriminator.blocks.2.downsample_res.conv.bias from state_dict.
Deleting key loss.discriminator.blocks.2.net.0.conv.weight from state_dict.
Deleting key loss.discriminator.blocks.2.net.0.conv.bias from state_dict.
Deleting key loss.discriminator.blocks.2.net.2.conv.weight from state_dict.
Deleting key loss.discriminator.blocks.2.net.2.conv.bias from state_dict.
Deleting key loss.discriminator.blocks.2.downsample.conv.weight from state_dict.
Deleting key loss.discriminator.blocks.2.downsample.conv.bias from state_dict.
Deleting key loss.discriminator.blocks.3.downsample_res.conv.weight from state_dict.
Deleting key loss.discriminator.blocks.3.downsample_res.conv.bias from state_dict.
Deleting key loss.discriminator.blocks.3.net.0.conv.weight from state_dict.
Deleting key loss.discriminator.blocks.3.net.0.conv.bias from state_dict.
Deleting key loss.discriminator.blocks.3.net.2.conv.weight from state_dict.
Deleting key loss.discriminator.blocks.3.net.2.conv.bias from state_dict.
Deleting key loss.discriminator.blocks.3.downsample.conv.weight from state_dict.
Deleting key loss.discriminator.blocks.3.downsample.conv.bias from state_dict.
Deleting key loss.discriminator.blocks.4.0.conv_res.weight from state_dict.
Deleting key loss.discriminator.blocks.4.0.conv_res.bias from state_dict.
Deleting key loss.discriminator.blocks.4.0.net.0.weight from state_dict.
Deleting key loss.discriminator.blocks.4.0.net.0.bias from state_dict.
Deleting key loss.discriminator.blocks.4.0.net.2.weight from state_dict.
Deleting key loss.discriminator.blocks.4.0.net.2.bias from state_dict.
Deleting key loss.discriminator.blocks.4.0.downsample.1.weight from state_dict.
Deleting key loss.discriminator.blocks.4.0.downsample.1.bias from state_dict.
Deleting key loss.discriminator.blocks.4.1.0.fn.norm.gamma from state_dict.
Deleting key loss.discriminator.blocks.4.1.0.fn.attn.to_q.0.weight from state_dict.
Deleting key loss.discriminator.blocks.4.1.0.fn.attn.to_kv.0.weight from state_dict.
Deleting key loss.discriminator.blocks.4.1.0.fn.attn.to_out.0.weight from state_dict.
Deleting key loss.discriminator.blocks.4.1.1.fn.norm.gamma from state_dict.
Deleting key loss.discriminator.blocks.4.1.1.fn.net.0.weight from state_dict.
Deleting key loss.discriminator.blocks.4.1.1.fn.net.0.bias from state_dict.
Deleting key loss.discriminator.blocks.4.1.1.fn.net.2.weight from state_dict.
Deleting key loss.discriminator.blocks.4.1.1.fn.net.2.bias from state_dict.
Deleting key loss.discriminator.blocks.5.0.conv_res.weight from state_dict.
Deleting key loss.discriminator.blocks.5.0.conv_res.bias from state_dict.
Deleting key loss.discriminator.blocks.5.0.net.0.weight from state_dict.
Deleting key loss.discriminator.blocks.5.0.net.0.bias from state_dict.
Deleting key loss.discriminator.blocks.5.0.net.2.weight from state_dict.
Deleting key loss.discriminator.blocks.5.0.net.2.bias from state_dict.
Deleting key loss.discriminator.blocks.5.0.downsample.1.weight from state_dict.
Deleting key loss.discriminator.blocks.5.0.downsample.1.bias from state_dict.
Deleting key loss.discriminator.blocks.5.1.0.fn.norm.gamma from state_dict.
Deleting key loss.discriminator.blocks.5.1.0.fn.attn.to_q.0.weight from state_dict.
Deleting key loss.discriminator.blocks.5.1.0.fn.attn.to_kv.0.weight from state_dict.
Deleting key loss.discriminator.blocks.5.1.0.fn.attn.to_out.0.weight from state_dict.
Deleting key loss.discriminator.blocks.5.1.1.fn.norm.gamma from state_dict.
Deleting key loss.discriminator.blocks.5.1.1.fn.net.0.weight from state_dict.
Deleting key loss.discriminator.blocks.5.1.1.fn.net.0.bias from state_dict.
Deleting key loss.discriminator.blocks.5.1.1.fn.net.2.weight from state_dict.
Deleting key loss.discriminator.blocks.5.1.1.fn.net.2.bias from state_dict.
Deleting key loss.discriminator.blocks.6.0.conv_res.weight from state_dict.
Deleting key loss.discriminator.blocks.6.0.conv_res.bias from state_dict.
Deleting key loss.discriminator.blocks.6.0.net.0.weight from state_dict.
Deleting key loss.discriminator.blocks.6.0.net.0.bias from state_dict.
Deleting key loss.discriminator.blocks.6.0.net.2.weight from state_dict.
Deleting key loss.discriminator.blocks.6.0.net.2.bias from state_dict.
Deleting key loss.discriminator.blocks.6.1.0.fn.norm.gamma from state_dict.
Deleting key loss.discriminator.blocks.6.1.0.fn.attn.to_q.0.weight from state_dict.
Deleting key loss.discriminator.blocks.6.1.0.fn.attn.to_kv.0.weight from state_dict.
Deleting key loss.discriminator.blocks.6.1.0.fn.attn.to_out.0.weight from state_dict.
Deleting key loss.discriminator.blocks.6.1.1.fn.norm.gamma from state_dict.
Deleting key loss.discriminator.blocks.6.1.1.fn.net.0.weight from state_dict.
Deleting key loss.discriminator.blocks.6.1.1.fn.net.0.bias from state_dict.
Deleting key loss.discriminator.blocks.6.1.1.fn.net.2.weight from state_dict.
Deleting key loss.discriminator.blocks.6.1.1.fn.net.2.bias from state_dict.
Deleting key loss.discriminator.to_logits.0.weight from state_dict.
Deleting key loss.discriminator.to_logits.0.bias from state_dict.
Deleting key loss.discriminator.to_logits.3.weight from state_dict.
Deleting key loss.discriminator.to_logits.3.bias from state_dict.
Missing keys: []
Unexpected keys: []
Restored from CogVideoX-2b-sat/vae/3d-vae.pt
[2024-09-03 02:36:56,856] [INFO] [RANK 0] > number of parameters on model parallel rank 0: 6764790755
[2024-09-03 02:37:15,810] [INFO] [RANK 0] global rank 0 is loading checkpoint CogVideoX-2b-sat/transformer/1000/mp_rank_00_model_states.pt
/data1/anaconda3/envs/cogvideo/lib/python3.10/site-packages/sat/training/model_io.py:286: FutureWarning: You are using torch.load with weights_only=False (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for weights_only will be flipped to True. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via torch.serialization.add_safe_globals. We recommend you start setting weights_only=True for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature.
sd = torch.load(checkpoint_name, map_location='cpu')
[2024-09-03 02:37:17,758] [INFO] [RANK 0] > successfully loaded CogVideoX-2b-sat/transformer/1000/mp_rank_00_model_states.pt
[2024-09-03 02:37:18,437] [INFO] [RANK 0] ***** Total trainable parameters: 58982400 *****
[2024-09-03 02:37:18,437] [INFO] [RANK 0] [<class 'sat.ops.layernorm.LayerNorm'>, <class 'torch.nn.modules.normalization.LayerNorm'>, <class 'sat.ops.layernorm.RMSNorm'>] is set to no_weight_decay
[2024-09-03 02:37:18,440] [INFO] [RANK 0] Syncing initialized parameters...
[2024-09-03 02:37:18,503] [INFO] [RANK 0] Finished syncing initialized parameters.
[2024-09-03 02:37:18,503] [INFO] [RANK 0] Using optimizer sat.ops.FusedEmaAdam from sat.
[2024-09-03 02:37:18,503] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed info: version=0.14.4, git-hash=unknown, git-branch=unknown
[2024-09-03 02:37:18,503] [WARNING] [config_utils.py:69:_process_deprecated_field] Config parameter cpu_offload is deprecated use offload_optimizer instead
[2024-09-03 02:37:18,646] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Flops Profiler Enabled: False
Using /data1/.cache/torch_extensions/py310_cu121 as PyTorch extensions root...
Detected CUDA files, patching ldflags
Emitting ninja build file /data1/.cache/torch_extensions/py310_cu121/fused_ema_adam/build.ninja...
/data1/anaconda3/envs/cogvideo/lib/python3.10/site-packages/torch/utils/cpp_extension.py:1965: UserWarning: TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation.
If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'].
warnings.warn(
Building extension module fused_ema_adam...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module fused_ema_adam...
Time to load fused_ema_adam op: 0.07278060913085938 seconds
[2024-09-03 02:37:18,724] [INFO] [logging.py:96:log_dist] [Rank 0] Using client callable to create basic optimizer
[2024-09-03 02:37:18,725] [INFO] [logging.py:96:log_dist] [Rank 0] Removing param_group that has no 'params' in the basic Optimizer
[2024-09-03 02:37:18,762] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Basic Optimizer = FusedEmaAdam
[2024-09-03 02:37:18,763] [INFO] [utils.py:56:is_zero_supported_optimizer] Checking ZeRO support for optimizer=FusedEmaAdam type=<class 'sat.ops.fused_ema_adam.FusedEmaAdam'>
[2024-09-03 02:37:18,763] [WARNING] [engine.py:1179:do_optimizer_sanity_check] **** You are using ZeRO with an untested optimizer, proceed with caution *****
[2024-09-03 02:37:18,763] [INFO] [logging.py:96:log_dist] [Rank 0] Creating torch.float16 ZeRO stage 2 optimizer
[2024-09-03 02:37:18,763] [INFO] [stage_1_and_2.py:148:init] Reduce bucket size 1000000000
[2024-09-03 02:37:18,763] [INFO] [stage_1_and_2.py:149:init] Allgather bucket size 1000000000
[2024-09-03 02:37:18,763] [INFO] [stage_1_and_2.py:150:init] CPU Offload: False
[2024-09-03 02:37:18,763] [INFO] [stage_1_and_2.py:151:init] Round robin gradient partitioning: False
[2024-09-03 02:37:23,295] [INFO] [utils.py:781:see_memory_usage] Before initializing optimizer states
[2024-09-03 02:37:23,295] [INFO] [utils.py:782:see_memory_usage] MA 12.86 GB Max_MA 12.97 GB CA 13.23 GB Max_CA 13 GB
[2024-09-03 02:37:23,295] [INFO] [utils.py:789:see_memory_usage] CPU Virtual Memory: used = 478.67 GB, percent = 23.7%
[2024-09-03 02:37:23,814] [INFO] [utils.py:781:see_memory_usage] After initializing optimizer states
[2024-09-03 02:37:23,814] [INFO] [utils.py:782:see_memory_usage] MA 12.86 GB Max_MA 13.08 GB CA 13.45 GB Max_CA 13 GB
[2024-09-03 02:37:23,814] [INFO] [utils.py:789:see_memory_usage] CPU Virtual Memory: used = 483.06 GB, percent = 24.0%
[2024-09-03 02:37:23,815] [INFO] [stage_1_and_2.py:543:init] optimizer state initialized
[2024-09-03 02:37:24,129] [INFO] [utils.py:781:see_memory_usage] After initializing ZeRO optimizer
[2024-09-03 02:37:24,130] [INFO] [utils.py:782:see_memory_usage] MA 12.86 GB Max_MA 12.86 GB CA 13.45 GB Max_CA 13 GB
[2024-09-03 02:37:24,130] [INFO] [utils.py:789:see_memory_usage] CPU Virtual Memory: used = 485.83 GB, percent = 24.1%
[2024-09-03 02:37:24,134] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Final Optimizer = DeepSpeedZeroOptimizer
[2024-09-03 02:37:24,134] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed using client LR scheduler
[2024-09-03 02:37:24,134] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed LR Scheduler = None
[2024-09-03 02:37:24,134] [INFO] [logging.py:96:log_dist] [Rank 0] step=0, skipped=0, lr=[1.0], mom=[[0.9, 0.95]]
[2024-09-03 02:37:24,137] [INFO] [config.py:997:print] DeepSpeedEngine configuration:
[2024-09-03 02:37:24,137] [INFO] [config.py:1001:print] activation_checkpointing_config {
"partition_activations": false,
"contiguous_memory_optimization": false,
"cpu_checkpointing": false,
"number_checkpoints": null,
"synchronize_checkpoint_boundary": false,
"profile": false
}
[2024-09-03 02:37:24,137] [INFO] [config.py:1001:print] aio_config ................... {'block_size': 1048576, 'queue_depth': 8, 'thread_count': 1, 'single_submit': False, 'overlap_events': True}
[2024-09-03 02:37:24,137] [INFO] [config.py:1001:print] amp_enabled .................. False
[2024-09-03 02:37:24,137] [INFO] [config.py:1001:print] amp_params ................... False
[2024-09-03 02:37:24,137] [INFO] [config.py:1001:print] autotuning_config ............ {
"enabled": false,
"start_step": null,
"end_step": null,
"metric_path": null,
"arg_mappings": null,
"metric": "throughput",
"model_info": null,
"results_dir": "autotuning_results",
"exps_dir": "autotuning_exps",
"overwrite": true,
"fast": true,
"start_profile_step": 3,
"end_profile_step": 5,
"tuner_type": "gridsearch",
"tuner_early_stopping": 5,
"tuner_num_trials": 50,
"model_info_path": null,
"mp_size": 1,
"max_train_batch_size": null,
"min_train_batch_size": 1,
"max_train_micro_batch_size_per_gpu": 1.024000e+03,
"min_train_micro_batch_size_per_gpu": 1,
"num_tuning_micro_batch_sizes": 3
}
[2024-09-03 02:37:24,137] [INFO] [config.py:1001:print] bfloat16_enabled ............. False
[2024-09-03 02:37:24,137] [INFO] [config.py:1001:print] bfloat16_immediate_grad_update False
[2024-09-03 02:37:24,137] [INFO] [config.py:1001:print] checkpoint_parallel_write_pipeline False
[2024-09-03 02:37:24,137] [INFO] [config.py:1001:print] checkpoint_tag_validation_enabled True
[2024-09-03 02:37:24,137] [INFO] [config.py:1001:print] checkpoint_tag_validation_fail False
[2024-09-03 02:37:24,137] [INFO] [config.py:1001:print] comms_config ................. <deepspeed.comm.config.DeepSpeedCommsConfig object at 0x15507917fbb0>
[2024-09-03 02:37:24,137] [INFO] [config.py:1001:print] communication_data_type ...... None
[2024-09-03 02:37:24,138] [INFO] [config.py:1001:print] compression_config ........... {'weight_quantization': {'shared_parameters': {'enabled': False, 'quantizer_kernel': False, 'schedule_offset': 0, 'quantize_groups': 1, 'quantize_verbose': False, 'quantization_type': 'symmetric', 'quantize_weight_in_forward': False, 'rounding': 'nearest', 'fp16_mixed_quantize': False, 'quantize_change_ratio': 0.001}, 'different_groups': {}}, 'activation_quantization': {'shared_parameters': {'enabled': False, 'quantization_type': 'symmetric', 'range_calibration': 'dynamic', 'schedule_offset': 1000}, 'different_groups': {}}, 'sparse_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'row_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'head_pruning': {'shared_parameters': {'enabled': False, 'method': 'topk', 'schedule_offset': 1000}, 'different_groups': {}}, 'channel_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'layer_reduction': {'enabled': False}}
[2024-09-03 02:37:24,138] [INFO] [config.py:1001:print] curriculum_enabled_legacy .... False
[2024-09-03 02:37:24,138] [INFO] [config.py:1001:print] curriculum_params_legacy ..... False
[2024-09-03 02:37:24,138] [INFO] [config.py:1001:print] data_efficiency_config ....... {'enabled': False, 'seed': 1234, 'data_sampling': {'enabled': False, 'num_epochs': 1000, 'num_workers': 0, 'curriculum_learning': {'enabled': False}}, 'data_routing': {'enabled': False, 'random_ltd': {'enabled': False, 'layer_token_lr_schedule': {'enabled': False}}}}
[2024-09-03 02:37:24,138] [INFO] [config.py:1001:print] data_efficiency_enabled ...... False
[2024-09-03 02:37:24,138] [INFO] [config.py:1001:print] dataloader_drop_last ......... False
[2024-09-03 02:37:24,138] [INFO] [config.py:1001:print] disable_allgather ............ False
[2024-09-03 02:37:24,138] [INFO] [config.py:1001:print] dump_state ................... False
[2024-09-03 02:37:24,138] [INFO] [config.py:1001:print] dynamic_loss_scale_args ...... None
[2024-09-03 02:37:24,138] [INFO] [config.py:1001:print] eigenvalue_enabled ........... False
[2024-09-03 02:37:24,138] [INFO] [config.py:1001:print] eigenvalue_gas_boundary_resolution 1
[2024-09-03 02:37:24,138] [INFO] [config.py:1001:print] eigenvalue_layer_name ........ bert.encoder.layer
[2024-09-03 02:37:24,138] [INFO] [config.py:1001:print] eigenvalue_layer_num ......... 0
[2024-09-03 02:37:24,138] [INFO] [config.py:1001:print] eigenvalue_max_iter .......... 100
[2024-09-03 02:37:24,138] [INFO] [config.py:1001:print] eigenvalue_stability ......... 1e-06
[2024-09-03 02:37:24,138] [INFO] [config.py:1001:print] eigenvalue_tol ............... 0.01
[2024-09-03 02:37:24,138] [INFO] [config.py:1001:print] eigenvalue_verbose ........... False
[2024-09-03 02:37:24,138] [INFO] [config.py:1001:print] elasticity_enabled ........... False
[2024-09-03 02:37:24,138] [INFO] [config.py:1001:print] flops_profiler_config ........ {
"enabled": false,
"recompute_fwd_factor": 0.0,
"profile_step": 1,
"module_depth": -1,
"top_modules": 1,
"detailed": true,
"output_file": null
}
[2024-09-03 02:37:24,138] [INFO] [config.py:1001:print] fp16_auto_cast ............... False
[2024-09-03 02:37:24,138] [INFO] [config.py:1001:print] fp16_enabled ................. True
[2024-09-03 02:37:24,138] [INFO] [config.py:1001:print] fp16_master_weights_and_gradients False
[2024-09-03 02:37:24,138] [INFO] [config.py:1001:print] global_rank .................. 0
[2024-09-03 02:37:24,138] [INFO] [config.py:1001:print] grad_accum_dtype ............. None
[2024-09-03 02:37:24,138] [INFO] [config.py:1001:print] gradient_accumulation_steps .. 1
[2024-09-03 02:37:24,138] [INFO] [config.py:1001:print] gradient_clipping ............ 0.1
[2024-09-03 02:37:24,138] [INFO] [config.py:1001:print] gradient_predivide_factor .... 1.0
[2024-09-03 02:37:24,138] [INFO] [config.py:1001:print] graph_harvesting ............. False
[2024-09-03 02:37:24,138] [INFO] [config.py:1001:print] hybrid_engine ................ enabled=False max_out_tokens=512 inference_tp_size=1 release_inference_cache=False pin_parameters=True tp_gather_partition_size=8
[2024-09-03 02:37:24,138] [INFO] [config.py:1001:print] initial_dynamic_scale ........ 65536
[2024-09-03 02:37:24,138] [INFO] [config.py:1001:print] load_universal_checkpoint .... False
[2024-09-03 02:37:24,138] [INFO] [config.py:1001:print] loss_scale ................... 0
[2024-09-03 02:37:24,138] [INFO] [config.py:1001:print] memory_breakdown ............. False
[2024-09-03 02:37:24,138] [INFO] [config.py:1001:print] mics_hierarchial_params_gather False
[2024-09-03 02:37:24,138] [INFO] [config.py:1001:print] mics_shard_size .............. -1
[2024-09-03 02:37:24,138] [INFO] [config.py:1001:print] monitor_config ............... tensorboard=TensorBoardConfig(enabled=False, output_path='', job_name='DeepSpeedJobName') comet=CometConfig(enabled=False, samples_log_interval=100, project=None, workspace=None, api_key=None, experiment_name=None, experiment_key=None, online=None, mode=None) wandb=WandbConfig(enabled=False, group=None, team=None, project='deepspeed') csv_monitor=CSVConfig(enabled=False, output_path='', job_name='DeepSpeedJobName') enabled=False
[2024-09-03 02:37:24,138] [INFO] [config.py:1001:print] nebula_config ................ {
"enabled": false,
"persistent_storage_path": null,
"persistent_time_interval": 100,
"num_of_version_in_retention": 2,
"enable_nebula_load": true,
"load_path": null
}
[2024-09-03 02:37:24,138] [INFO] [config.py:1001:print] optimizer_legacy_fusion ...... False
[2024-09-03 02:37:24,138] [INFO] [config.py:1001:print] optimizer_name ............... None
[2024-09-03 02:37:24,138] [INFO] [config.py:1001:print] optimizer_params ............. None
[2024-09-03 02:37:24,138] [INFO] [config.py:1001:print] pipeline ..................... {'stages': 'auto', 'partition': 'best', 'seed_layers': False, 'activation_checkpoint_interval': 0, 'pipe_partitioned': True, 'grad_partitioned': True}
[2024-09-03 02:37:24,138] [INFO] [config.py:1001:print] pld_enabled .................. False
[2024-09-03 02:37:24,138] [INFO] [config.py:1001:print] pld_params ................... False
[2024-09-03 02:37:24,139] [INFO] [config.py:1001:print] prescale_gradients ........... False
[2024-09-03 02:37:24,139] [INFO] [config.py:1001:print] scheduler_name ............... None
[2024-09-03 02:37:24,139] [INFO] [config.py:1001:print] scheduler_params ............. None
[2024-09-03 02:37:24,139] [INFO] [config.py:1001:print] seq_parallel_communication_data_type torch.float32
[2024-09-03 02:37:24,139] [INFO] [config.py:1001:print] sparse_attention ............. None
[2024-09-03 02:37:24,139] [INFO] [config.py:1001:print] sparse_gradients_enabled ..... False
[2024-09-03 02:37:24,139] [INFO] [config.py:1001:print] steps_per_print .............. 50
[2024-09-03 02:37:24,139] [INFO] [config.py:1001:print] timers_config ................ enabled=True synchronized=True
[2024-09-03 02:37:24,139] [INFO] [config.py:1001:print] train_batch_size ............. 2
[2024-09-03 02:37:24,139] [INFO] [config.py:1001:print] train_micro_batch_size_per_gpu 2
[2024-09-03 02:37:24,139] [INFO] [config.py:1001:print] use_data_before_expert_parallel False
[2024-09-03 02:37:24,139] [INFO] [config.py:1001:print] use_node_local_storage ....... False
[2024-09-03 02:37:24,139] [INFO] [config.py:1001:print] wall_clock_breakdown ......... False
[2024-09-03 02:37:24,139] [INFO] [config.py:1001:print] weight_quantization_config ... None
[2024-09-03 02:37:24,139] [INFO] [config.py:1001:print] world_size ................... 1
[2024-09-03 02:37:24,139] [INFO] [config.py:1001:print] zero_allow_untested_optimizer True
[2024-09-03 02:37:24,139] [INFO] [config.py:1001:print] zero_config .................. stage=2 contiguous_gradients=False reduce_scatter=True reduce_bucket_size=1000000000 use_multi_rank_bucket_allreduce=True allgather_partitions=True allgather_bucket_size=1000000000 overlap_comm=True load_from_fp32_weights=False elastic_checkpoint=False offload_param=None offload_optimizer=None sub_group_size=1,000,000,000 cpu_offload_param=None cpu_offload_use_pin_memory=None prefetch_bucket_size=50,000,000 param_persistence_threshold=100,000 model_persistence_threshold=sys.maxsize max_live_parameters=1,000,000,000 max_reuse_distance=1,000,000,000 gather_16bit_weights_on_model_save=False use_all_reduce_for_fetch_params=False stage3_gather_fp16_weights_on_model_save=False ignore_unused_parameters=True legacy_stage1=False round_robin_gradients=False zero_hpz_partition_size=1 zero_quantized_weights=False zero_quantized_nontrainable_weights=False zero_quantized_gradients=False mics_shard_size=-1 mics_hierarchical_params_gather=False memory_efficient_linear=True pipeline_loading_checkpoint=False override_module_apply=True
[2024-09-03 02:37:24,139] [INFO] [config.py:1001:print] zero_enabled ................. True
[2024-09-03 02:37:24,139] [INFO] [config.py:1001:print] zero_force_ds_cpu_optimizer .. True
[2024-09-03 02:37:24,139] [INFO] [config.py:1001:print] zero_optimization_stage ...... 2
[2024-09-03 02:37:24,139] [INFO] [config.py:987:print_user_config] json = {
"train_micro_batch_size_per_gpu": 2,
"gradient_accumulation_steps": 1,
"steps_per_print": 50,
"gradient_clipping": 0.1,
"zero_optimization": {
"stage": 2,
"cpu_offload": false,
"contiguous_gradients": false,
"overlap_comm": true,
"reduce_scatter": true,
"reduce_bucket_size": 1.000000e+09,
"allgather_bucket_size": 1.000000e+09,
"load_from_fp32_weights": false
},
"zero_allow_untested_optimizer": true,
"bf16": {
"enabled": false
},
"fp16": {
"enabled": true
},
"loss_scale": 0,
"loss_scale_window": 400,
"hysteresis": 2,
"min_loss_scale": 1,
"activation_checkpointing": {
"partition_activations": false,
"contiguous_memory_optimization": false
},
"wall_clock_breakdown": false
}
[2024-09-03 02:37:24,139] [INFO] [RANK 0] learning rate decaying style linear, ratio 10.0
[2024-09-03 02:37:24,139] [INFO] [RANK 0] Finetuning Model...
[2024-09-03 02:37:24,139] [INFO] [RANK 0] arguments:
[2024-09-03 02:37:24,139] [INFO] [RANK 0] base ......................... ['configs/cogvideox_2b_lora.yaml', 'configs/sft.yaml']
[2024-09-03 02:37:24,139] [INFO] [RANK 0] model_parallel_size .......... 1
[2024-09-03 02:37:24,139] [INFO] [RANK 0] force_pretrain ............... False
[2024-09-03 02:37:24,139] [INFO] [RANK 0] device ....................... 0
[2024-09-03 02:37:24,139] [INFO] [RANK 0] debug ........................ False
[2024-09-03 02:37:24,139] [INFO] [RANK 0] log_image .................... True
[2024-09-03 02:37:24,139] [INFO] [RANK 0] output_dir ................... samples
[2024-09-03 02:37:24,139] [INFO] [RANK 0] input_dir .................... None
[2024-09-03 02:37:24,139] [INFO] [RANK 0] input_type ................... cli
[2024-09-03 02:37:24,139] [INFO] [RANK 0] input_file ................... input.txt
[2024-09-03 02:37:24,139] [INFO] [RANK 0] final_size ................... 2048
[2024-09-03 02:37:24,140] [INFO] [RANK 0] sdedit ....................... False
[2024-09-03 02:37:24,140] [INFO] [RANK 0] grid_num_rows ................ 1
[2024-09-03 02:37:24,140] [INFO] [RANK 0] force_inference .............. False
[2024-09-03 02:37:24,140] [INFO] [RANK 0] lcm_steps .................... None
[2024-09-03 02:37:24,140] [INFO] [RANK 0] sampling_num_frames .......... 32
[2024-09-03 02:37:24,140] [INFO] [RANK 0] sampling_fps ................. 8
[2024-09-03 02:37:24,140] [INFO] [RANK 0] only_save_latents ............ False
[2024-09-03 02:37:24,140] [INFO] [RANK 0] only_log_video_latents ....... True
[2024-09-03 02:37:24,140] [INFO] [RANK 0] latent_channels .............. 32
[2024-09-03 02:37:24,140] [INFO] [RANK 0] image2video .................. False
[2024-09-03 02:37:24,140] [INFO] [RANK 0] experiment_name .............. example_data-09-03-02-36
[2024-09-03 02:37:24,140] [INFO] [RANK 0] train_iters .................. 1000
[2024-09-03 02:37:24,140] [INFO] [RANK 0] batch_size ................... 2
[2024-09-03 02:37:24,140] [INFO] [RANK 0] lr ........................... 0.001
[2024-09-03 02:37:24,140] [INFO] [RANK 0] mode ......................... finetune
[2024-09-03 02:37:24,140] [INFO] [RANK 0] seed ......................... 22338
[2024-09-03 02:37:24,140] [INFO] [RANK 0] zero_stage ................... 0
[2024-09-03 02:37:24,140] [INFO] [RANK 0] checkpoint_activations ....... True
[2024-09-03 02:37:24,140] [INFO] [RANK 0] checkpoint_num_layers ........ 1
[2024-09-03 02:37:24,140] [INFO] [RANK 0] checkpoint_skip_layers ....... 0
[2024-09-03 02:37:24,140] [INFO] [RANK 0] fp16 ......................... True
[2024-09-03 02:37:24,140] [INFO] [RANK 0] bf16 ......................... False
[2024-09-03 02:37:24,140] [INFO] [RANK 0] gradient_accumulation_steps .. 1
[2024-09-03 02:37:24,140] [INFO] [RANK 0] profiling .................... -1
[2024-09-03 02:37:24,140] [INFO] [RANK 0] epochs ....................... None
[2024-09-03 02:37:24,140] [INFO] [RANK 0] log_interval ................. 20
[2024-09-03 02:37:24,140] [INFO] [RANK 0] summary_dir ..................
[2024-09-03 02:37:24,140] [INFO] [RANK 0] save_args .................... False
[2024-09-03 02:37:24,140] [INFO] [RANK 0] lr_decay_iters ............... None
[2024-09-03 02:37:24,140] [INFO] [RANK 0] lr_decay_style ............... linear
[2024-09-03 02:37:24,140] [INFO] [RANK 0] lr_decay_ratio ............... 0.1
[2024-09-03 02:37:24,140] [INFO] [RANK 0] warmup ....................... 0.01
[2024-09-03 02:37:24,140] [INFO] [RANK 0] weight_decay ................. 0.0001
[2024-09-03 02:37:24,140] [INFO] [RANK 0] save ......................... ckpts_2b/example_data-09-03-02-36
[2024-09-03 02:37:24,140] [INFO] [RANK 0] load ......................... CogVideoX-2b-sat/transformer
[2024-09-03 02:37:24,140] [INFO] [RANK 0] force_train .................. True
[2024-09-03 02:37:24,140] [INFO] [RANK 0] save_interval ................ 500
[2024-09-03 02:37:24,140] [INFO] [RANK 0] no_save_rng .................. False
[2024-09-03 02:37:24,140] [INFO] [RANK 0] no_load_rng .................. True
[2024-09-03 02:37:24,140] [INFO] [RANK 0] resume_dataloader ............ False
[2024-09-03 02:37:24,141] [INFO] [RANK 0] distributed_backend .......... nccl
[2024-09-03 02:37:24,141] [INFO] [RANK 0] local_rank ................... 0
[2024-09-03 02:37:24,141] [INFO] [RANK 0] exit_interval ................ None
[2024-09-03 02:37:24,141] [INFO] [RANK 0] wandb ........................ False
[2024-09-03 02:37:24,141] [INFO] [RANK 0] wandb_project_name ........... default_project
[2024-09-03 02:37:24,141] [INFO] [RANK 0] eval_batch_size .............. 1
[2024-09-03 02:37:24,141] [INFO] [RANK 0] eval_iters ................... 1
[2024-09-03 02:37:24,141] [INFO] [RANK 0] eval_interval ................ 100
[2024-09-03 02:37:24,141] [INFO] [RANK 0] strict_eval .................. False
[2024-09-03 02:37:24,141] [INFO] [RANK 0] train_data ................... ['toy_data']
[2024-09-03 02:37:24,141] [INFO] [RANK 0] train_data_weights ........... None
[2024-09-03 02:37:24,141] [INFO] [RANK 0] iterable_dataset ............. False
[2024-09-03 02:37:24,141] [INFO] [RANK 0] iterable_dataset_eval ........
[2024-09-03 02:37:24,141] [INFO] [RANK 0] batch_from_same_dataset ...... False
[2024-09-03 02:37:24,141] [INFO] [RANK 0] valid_data ................... ['toy_data']
[2024-09-03 02:37:24,141] [INFO] [RANK 0] test_data .................... None
[2024-09-03 02:37:24,141] [INFO] [RANK 0] split ........................ 1,0,0
[2024-09-03 02:37:24,141] [INFO] [RANK 0] num_workers .................. 8
[2024-09-03 02:37:24,141] [INFO] [RANK 0] block_size ................... 10000
[2024-09-03 02:37:24,141] [INFO] [RANK 0] prefetch_factor .............. 4
[2024-09-03 02:37:24,141] [INFO] [RANK 0] deepspeed .................... True
[2024-09-03 02:37:24,141] [INFO] [RANK 0] deepspeed_config ............. {'train_micro_batch_size_per_gpu': 2, 'gradient_accumulation_steps': 1, 'steps_per_print': 50, 'gradient_clipping': 0.1, 'zero_optimization': {'stage': 2, 'cpu_offload': False, 'contiguous_gradients': False, 'overlap_comm': True, 'reduce_scatter': True, 'reduce_bucket_size': 1000000000, 'allgather_bucket_size': 1000000000, 'load_from_fp32_weights': False}, 'zero_allow_untested_optimizer': True, 'bf16': {'enabled': False}, 'fp16': {'enabled': True}, 'loss_scale': 0, 'loss_scale_window': 400, 'hysteresis': 2, 'min_loss_scale': 1, 'activation_checkpointing': {'partition_activations': False, 'contiguous_memory_optimization': False}, 'wall_clock_breakdown': False}
[2024-09-03 02:37:24,141] [INFO] [RANK 0] deepscale .................... False
[2024-09-03 02:37:24,141] [INFO] [RANK 0] deepscale_config ............. None
[2024-09-03 02:37:24,141] [INFO] [RANK 0] model_config ................. {'scale_factor': 1.15258426, 'disable_first_stage_autocast': True, 'not_trainable_prefixes': ['all'], 'log_keys': ['txt'], 'denoiser_config': {'target': 'sgm.modules.diffusionmodules.denoiser.DiscreteDenoiser', 'params': {'num_idx': 1000, 'quantize_c_noise': False, 'weighting_config': {'target': 'sgm.modules.diffusionmodules.denoiser_weighting.EpsWeighting'}, 'scaling_config': {'target': 'sgm.modules.diffusionmodules.denoiser_scaling.VideoScaling'}, 'discretization_config': {'target': 'sgm.modules.diffusionmodules.discretizer.ZeroSNRDDPMDiscretization', 'params': {'shift_scale': 3.0}}}}, 'network_config': {'target': 'dit_video_concat.DiffusionTransformer', 'params': {'time_embed_dim': 512, 'elementwise_affine': True, 'num_frames': 49, 'time_compressed_rate': 4, 'latent_width': 90, 'latent_height': 60, 'num_layers': 30, 'patch_size': 2, 'in_channels': 16, 'out_channels': 16, 'hidden_size': 1920, 'adm_in_channels': 256, 'num_attention_heads': 30, 'transformer_args': {'checkpoint_activations': True, 'vocab_size': 1, 'max_sequence_length': 64, 'layernorm_order': 'pre', 'skip_init': False, 'model_parallel_size': 1, 'is_decoder': False, 'num_layers': 30, 'hidden_size': 1920, 'num_attention_heads': 30, 'parallel_output': True}, 'modules': {'pos_embed_config': {'target': 'dit_video_concat.Basic3DPositionEmbeddingMixin', 'params': {'text_length': 226, 'height_interpolation': 1.875, 'width_interpolation': 1.875}}, 'lora_config': {'target': 'sat.model.finetune.lora2.LoraMixin', 'params': {'r': 128}}, 'patch_embed_config': {'target': 'dit_video_concat.ImagePatchEmbeddingMixin', 'params': {'text_hidden_size': 4096}}, 'adaln_layer_config': {'target': 'dit_video_concat.AdaLNMixin', 'params': {'qk_ln': True}}, 'final_layer_config': {'target': 'dit_video_concat.FinalLayerMixin'}}, 'dtype': 'fp16'}}, 'conditioner_config': {'target': 'sgm.modules.GeneralConditioner', 'params': {'emb_models': [{'is_trainable': False, 'input_key': 'txt', 'ucg_rate': 0.1, 'target': 'sgm.modules.encoders.modules.FrozenT5Embedder', 'params': {'model_dir': 'CogVideoX-2b-sat/t5-v1_1-xxl', 'max_length': 226}}]}}, 'first_stage_config': {'target': 'vae_modules.autoencoder.VideoAutoencoderInferenceWrapper', 'params': {'cp_size': 1, 'ckpt_path': 'CogVideoX-2b-sat/vae/3d-vae.pt', 'ignore_keys': ['loss'], 'loss_config': {'target': 'torch.nn.Identity'}, 'regularizer_config': {'target': 'vae_modules.regularizers.DiagonalGaussianRegularizer'}, 'encoder_config': {'target': 'vae_modules.cp_enc_dec.ContextParallelEncoder3D', 'params': {'double_z': True, 'z_channels': 16, 'resolution': 256, 'in_channels': 3, 'out_ch': 3, 'ch': 128, 'ch_mult': [1, 2, 2, 4], 'attn_resolutions': [], 'num_res_blocks': 3, 'dropout': 0.0, 'gather_norm': True}}, 'decoder_config': {'target': 'vae_modules.cp_enc_dec.ContextParallelDecoder3D', 'params': {'double_z': True, 'z_channels': 16, 'resolution': 256, 'in_channels': 3, 'out_ch': 3, 'ch': 128, 'ch_mult': [1, 2, 2, 4], 'attn_resolutions': [], 'num_res_blocks': 3, 'dropout': 0.0, 'gather_norm': False}}}}, 'loss_fn_config': {'target': 'sgm.modules.diffusionmodules.loss.VideoDiffusionLoss', 'params': {'offset_noise_level': 0, 'sigma_sampler_config': {'target': 'sgm.modules.diffusionmodules.sigma_sampling.DiscreteSampling', 'params': {'uniform_sampling': True, 'num_idx': 1000, 'discretization_config': {'target': 'sgm.modules.diffusionmodules.discretizer.ZeroSNRDDPMDiscretization', 'params': {'shift_scale': 3.0}}}}}}, 'sampler_config': {'target': 'sgm.modules.diffusionmodules.sampling.VPSDEDPMPP2MSampler', 'params': {'num_steps': 50, 'verbose': True, 'discretization_config': {'target': 'sgm.modules.diffusionmodules.discretizer.ZeroSNRDDPMDiscretization', 'params': {'shift_scale': 3.0}}, 'guider_config': {'target': 'sgm.modules.diffusionmodules.guiders.DynamicCFG', 'params': {'scale': 6, 'exp': 5, 'num_steps': 50}}}}}
[2024-09-03 02:37:24,141] [INFO] [RANK 0] data_config .................. {'target': 'data_video.SFTDataset', 'params': {'video_size': [480, 720], 'fps': 8, 'max_num_frames': 49, 'skip_frms_num': 3.0}}
[2024-09-03 02:37:24,141] [INFO] [RANK 0] cuda ......................... True
[2024-09-03 02:37:24,142] [INFO] [RANK 0] rank ......................... 0
[2024-09-03 02:37:24,142] [INFO] [RANK 0] world_size ................... 1
[2024-09-03 02:37:24,142] [INFO] [RANK 0] deepspeed_activation_checkpointing True
[2024-09-03 02:37:24,142] [INFO] [RANK 0] master_ip .................... localhost
[2024-09-03 02:37:24,142] [INFO] [RANK 0] master_port .................. 38137
[2024-09-03 02:37:24,142] [INFO] [RANK 0] log_config ................... [{'model': {'scale_factor': 1.15258426, 'disable_first_stage_autocast': True, 'not_trainable_prefixes': ['all'], 'log_keys': ['txt'], 'denoiser_config': {'target': 'sgm.modules.diffusionmodules.denoiser.DiscreteDenoiser', 'params': {'num_idx': 1000, 'quantize_c_noise': False, 'weighting_config': {'target': 'sgm.modules.diffusionmodules.denoiser_weighting.EpsWeighting'}, 'scaling_config': {'target': 'sgm.modules.diffusionmodules.denoiser_scaling.VideoScaling'}, 'discretization_config': {'target': 'sgm.modules.diffusionmodules.discretizer.ZeroSNRDDPMDiscretization', 'params': {'shift_scale': 3.0}}}}, 'network_config': {'target': 'dit_video_concat.DiffusionTransformer', 'params': {'time_embed_dim': 512, 'elementwise_affine': True, 'num_frames': 49, 'time_compressed_rate': 4, 'latent_width': 90, 'latent_height': 60, 'num_layers': 30, 'patch_size': 2, 'in_channels': 16, 'out_channels': 16, 'hidden_size': 1920, 'adm_in_channels': 256, 'num_attention_heads': 30, 'transformer_args': {'checkpoint_activations': True, 'vocab_size': 1, 'max_sequence_length': 64, 'layernorm_order': 'pre', 'skip_init': False, 'model_parallel_size': 1, 'is_decoder': False}, 'modules': {'pos_embed_config': {'target': 'dit_video_concat.Basic3DPositionEmbeddingMixin', 'params': {'text_length': 226, 'height_interpolation': 1.875, 'width_interpolation': 1.875}}, 'lora_config': {'target': 'sat.model.finetune.lora2.LoraMixin', 'params': {'r': 128}}, 'patch_embed_config': {'target': 'dit_video_concat.ImagePatchEmbeddingMixin', 'params': {'text_hidden_size': 4096}}, 'adaln_layer_config': {'target': 'dit_video_concat.AdaLNMixin', 'params': {'qk_ln': True}}, 'final_layer_config': {'target': 'dit_video_concat.FinalLayerMixin'}}}}, 'conditioner_config': {'target': 'sgm.modules.GeneralConditioner', 'params': {'emb_models': [{'is_trainable': False, 'input_key': 'txt', 'ucg_rate': 0.1, 'target': 'sgm.modules.encoders.modules.FrozenT5Embedder', 'params': {'model_dir': 'CogVideoX-2b-sat/t5-v1_1-xxl', 'max_length': 226}}]}}, 'first_stage_config': {'target': 'vae_modules.autoencoder.VideoAutoencoderInferenceWrapper', 'params': {'cp_size': 1, 'ckpt_path': 'CogVideoX-2b-sat/vae/3d-vae.pt', 'ignore_keys': ['loss'], 'loss_config': {'target': 'torch.nn.Identity'}, 'regularizer_config': {'target': 'vae_modules.regularizers.DiagonalGaussianRegularizer'}, 'encoder_config': {'target': 'vae_modules.cp_enc_dec.ContextParallelEncoder3D', 'params': {'double_z': True, 'z_channels': 16, 'resolution': 256, 'in_channels': 3, 'out_ch': 3, 'ch': 128, 'ch_mult': [1, 2, 2, 4], 'attn_resolutions': [], 'num_res_blocks': 3, 'dropout': 0.0, 'gather_norm': True}}, 'decoder_config': {'target': 'vae_modules.cp_enc_dec.ContextParallelDecoder3D', 'params': {'double_z': True, 'z_channels': 16, 'resolution': 256, 'in_channels': 3, 'out_ch': 3, 'ch': 128, 'ch_mult': [1, 2, 2, 4], 'attn_resolutions': [], 'num_res_blocks': 3, 'dropout': 0.0, 'gather_norm': False}}}}, 'loss_fn_config': {'target': 'sgm.modules.diffusionmodules.loss.VideoDiffusionLoss', 'params': {'offset_noise_level': 0, 'sigma_sampler_config': {'target': 'sgm.modules.diffusionmodules.sigma_sampling.DiscreteSampling', 'params': {'uniform_sampling': True, 'num_idx': 1000, 'discretization_config': {'target': 'sgm.modules.diffusionmodules.discretizer.ZeroSNRDDPMDiscretization', 'params': {'shift_scale': 3.0}}}}}}, 'sampler_config': {'target': 'sgm.modules.diffusionmodules.sampling.VPSDEDPMPP2MSampler', 'params': {'num_steps': 50, 'verbose': True, 'discretization_config': {'target': 'sgm.modules.diffusionmodules.discretizer.ZeroSNRDDPMDiscretization', 'params': {'shift_scale': 3.0}}, 'guider_config': {'target': 'sgm.modules.diffusionmodules.guiders.DynamicCFG', 'params': {'scale': 6, 'exp': 5, 'num_steps': 50}}}}}}, {'args': {'checkpoint_activations': True, 'model_parallel_size': 1, 'experiment_name': 'example_data', 'mode': 'finetune', 'load': 'CogVideoX-2b-sat/transformer', 'no_load_rng': True, 'train_iters': 1000, 'eval_iters': 1, 'eval_interval': 100, 'eval_batch_size': 1, 'save': 'ckpts_2b', 'save_interval': 500, 'log_interval': 20, 'train_data': ['toy_data'], 'valid_data': ['toy_data'], 'split': '1,0,0', 'num_workers': 8, 'force_train': True, 'only_log_video_latents': True}, 'data': {'target': 'data_video.SFTDataset', 'params': {'video_size': [480, 720], 'fps': 8, 'max_num_frames': 49, 'skip_frms_num': 3.0}}, 'deepspeed': {'train_micro_batch_size_per_gpu': 2, 'gradient_accumulation_steps': 1, 'steps_per_print': 50, 'gradient_clipping': 0.1, 'zero_optimization': {'stage': 2, 'cpu_offload': False, 'contiguous_gradients': False, 'overlap_comm': True, 'reduce_scatter': True, 'reduce_bucket_size': 1000000000, 'allgather_bucket_size': 1000000000, 'load_from_fp32_weights': False}, 'zero_allow_untested_optimizer': True, 'bf16': {'enabled': False}, 'fp16': {'enabled': True}, 'loss_scale': 0, 'loss_scale_window': 400, 'hysteresis': 2, 'min_loss_scale': 1, 'optimizer': {'type': 'sat.ops.FusedEmaAdam', 'params': {'lr': 0.001, 'betas': [0.9, 0.95], 'eps': '1e-8', 'weight_decay': '1e-4'}}, 'activation_checkpointing': {'partition_activations': False, 'contiguous_memory_optimization': False}, 'wall_clock_breakdown': False}}]
[2024-09-03 02:37:24,142] [INFO] [RANK 0] do_train ..................... True
[2024-09-03 02:37:24,142] [INFO] [RANK 0] val_last_shape ............... []
[2024-09-03 02:37:24,142] [INFO] [RANK 0] val_drop_number .............. 0
[2024-09-03 02:37:24,142] [INFO] [RANK 0] do_valid ..................... True
[2024-09-03 02:37:24,142] [INFO] [RANK 0] do_test ...................... False
[2024-09-03 02:37:24,142] [INFO] [RANK 0] iteration .................... 0
[2024-09-03 02:38:05,330] [INFO] [checkpointing.py:541:forward] Activation Checkpointing Information
[2024-09-03 02:38:05,330] [INFO] [checkpointing.py:542:forward] ----Partition Activations False, CPU CHECKPOINTING False
[2024-09-03 02:38:05,330] [INFO] [checkpointing.py:543:forward] ----contiguous Memory Checkpointing False with None total layers
[2024-09-03 02:38:05,330] [INFO] [checkpointing.py:545:forward] ----Synchronization False
[2024-09-03 02:38:05,330] [INFO] [checkpointing.py:546:forward] ----Profiling time in checkpointing False
[2024-09-03 02:38:17,054] [INFO] [loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 4294967296, reducing to 2147483648
[2024-09-03 02:38:39,582] [INFO] [loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 2147483648, reducing to 1073741824
[2024-09-03 02:39:01,823] [INFO] [loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 1073741824, reducing to 536870912
[2024-09-03 02:39:25,772] [INFO] [loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 536870912, reducing to 268435456
[2024-09-03 02:40:34,024] [INFO] [loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 268435456, reducing to 134217728
[2024-09-03 02:42:31,555] [INFO] [loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 134217728, reducing to 67108864
[2024-09-03 02:43:17,654] [INFO] [loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 67108864, reducing to 33554432
[2024-09-03 02:45:38,242] [INFO] [RANK 0] iteration 20/ 1000 | elapsed time per iteration (ms): 24611.4 | learning rate 5.000E-05 | total loss 2.157213E-01 | loss 2.157214E-01 | loss scale 33554432.0 |speed 4.88 samples/(minGPU)
[2024-09-03 02:45:38,244] [INFO] [RANK 0] after 20 iterations memory (MB) | allocated: 13974.6455078125 | max allocated: 64453.90478515625 | cached: 22772.0 | max cached: 76136.0
[2024-09-03 02:45:38,244] [INFO] [RANK 0] time (ms) | forward: 15575.33 | backward: 8930.58 | allreduce: 0.00 | optimizer: 101.48 | data loader: 19.31
[2024-09-03 02:53:15,005] [INFO] [RANK 0] iteration 40/ 1000 | elapsed time per iteration (ms): 22838.1 | learning rate 5.000E-05 | total loss 2.180460E-01 | loss 2.180460E-01 | loss scale 33554432.0 |speed 5.25 samples/(minGPU)
[2024-09-03 02:53:15,006] [INFO] [RANK 0] time (ms) | forward: 13797.73 | backward: 8961.48 | allreduce: 0.00 | optimizer: 74.96 | data loader: 0.37
[2024-09-03 02:54:01,688] [INFO] [loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 33554432, reducing to 16777216
[2024-09-03 02:57:02,117] [INFO] [logging.py:96:log_dist] [Rank 0] step=50, skipped=8, lr=[5e-05], mom=[[0.9, 0.95]]
bash finetune_single_gpu.sh
RUN on instance-butter, CUDA_VISIBLE_DEVICES=6
WORLD_SIZE=1 RANK=0 LOCAL_RANK=0 LOCAL_WORLD_SIZE=1 python train_video.py --base configs/cogvideox_2b_lora.yaml configs/sft.yaml --seed 22338
[2024-09-03 02:36:29,937] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[WARNING] async_io requires the dev libaio .so object and headers but these were not found.
[WARNING] async_io: please install the libaio-dev package with apt
[WARNING] If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
[WARNING] Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
[WARNING] sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.4
[WARNING] using untested triton version (3.0.0), only 1.0.0 is known to be compatible
/data1/anaconda3/envs/cogvideo/lib/python3.10/site-packages/deepspeed/runtime/zero/linear.py:49: FutureWarning: torch.cuda.amp.custom_fwd(args...) is deprecated. Please use torch.amp.custom_fwd(args..., device_type='cuda') instead.
def forward(ctx, input, weight, bias=None):
/data1/anaconda3/envs/cogvideo/lib/python3.10/site-packages/deepspeed/runtime/zero/linear.py:67: FutureWarning: torch.cuda.amp.custom_bwd(args...) is deprecated. Please use torch.amp.custom_bwd(args..., device_type='cuda') instead.
def backward(ctx, grad_output):
/data1/anaconda3/envs/cogvideo/lib/python3.10/site-packages/kornia/feature/lightglue.py:44: FutureWarning: torch.cuda.amp.custom_fwd(args...) is deprecated. Please use torch.amp.custom_fwd(args..., device_type='cuda') instead.
@torch.cuda.amp.custom_fwd(cast_inputs=torch.float32)
/data1/anaconda3/envs/cogvideo/lib/python3.10/site-packages/xformers/ops/fmha/flash.py:211: FutureWarning: torch.library.impl_abstract was renamed to torch.library.register_fake. Please use that instead; we will remove torch.library.impl_abstract in a future version of PyTorch.
@torch.library.impl_abstract("xformers_flash::flash_fwd")
/data1/anaconda3/envs/cogvideo/lib/python3.10/site-packages/xformers/ops/fmha/flash.py:344: FutureWarning: torch.library.impl_abstract was renamed to torch.library.register_fake. Please use that instead; we will remove torch.library.impl_abstract in a future version of PyTorch.
@torch.library.impl_abstract("xformers_flash::flash_bwd")
[2024-09-03 02:36:34,890] [INFO] using world size: 1
[2024-09-03 02:36:34,891] [INFO] Will override arguments with manually specified deepspeed_config!
[2024-09-03 02:36:34,893] [INFO] [RANK 0] > initializing model parallel with size 1
[2024-09-03 02:36:34,894] [INFO] [comm.py:637:init_distributed] cdb=None
[2024-09-03 02:36:34,922] [INFO] [RANK 0] building SATVideoDiffusionEngine model ...
[2024-09-03 02:36:44,771] [INFO] [RANK 0] replacing layer 0 attention with lora
[2024-09-03 02:36:44,904] [INFO] [RANK 0] replacing layer 1 attention with lora
[2024-09-03 02:36:45,021] [INFO] [RANK 0] replacing layer 2 attention with lora
[2024-09-03 02:36:45,131] [INFO] [RANK 0] replacing layer 3 attention with lora
[2024-09-03 02:36:45,241] [INFO] [RANK 0] replacing layer 4 attention with lora
[2024-09-03 02:36:45,352] [INFO] [RANK 0] replacing layer 5 attention with lora
[2024-09-03 02:36:45,466] [INFO] [RANK 0] replacing layer 6 attention with lora
[2024-09-03 02:36:45,575] [INFO] [RANK 0] replacing layer 7 attention with lora
[2024-09-03 02:36:45,687] [INFO] [RANK 0] replacing layer 8 attention with lora
[2024-09-03 02:36:45,851] [INFO] [RANK 0] replacing layer 9 attention with lora
[2024-09-03 02:36:45,957] [INFO] [RANK 0] replacing layer 10 attention with lora
[2024-09-03 02:36:46,063] [INFO] [RANK 0] replacing layer 11 attention with lora
[2024-09-03 02:36:46,173] [INFO] [RANK 0] replacing layer 12 attention with lora
[2024-09-03 02:36:46,280] [INFO] [RANK 0] replacing layer 13 attention with lora
[2024-09-03 02:36:46,387] [INFO] [RANK 0] replacing layer 14 attention with lora
[2024-09-03 02:36:46,495] [INFO] [RANK 0] replacing l[2024-09-03 03:00:52,438] [INFO] [RANK 0] iteration 60/ 1000 | elapsed time per iteration (ms): 22871.6 | learning rate 5.000E-05 | total loss 2.016617E-01 | loss 2.016617E-01 | loss scale 16777216.0 |speed 5.25 samples/(min*GPU)
[2024-09-03 03:00:52,438] [INFO] [RANK 0] time (ms) | forward: 13888.08 | backward: 8902.19 | allreduce: 0.00 | optimizer: 77.79 | data loader: 0.66
跑的是 finetune_single_gpu.sh
export CUDA_VISIBLE_DEVICES=6
echo "RUN on `hostname`, CUDA_VISIBLE_DEVICES=$CUDA_VISIBLE_DEVICES"
environs="WORLD_SIZE=1 RANK=0 LOCAL_RANK=0 LOCAL_WORLD_SIZE=1"
run_cmd="$environs python train_video.py --base configs/cogvideox_2b_lora.yaml configs/sft.yaml --seed $RANDOM"
echo ${run_cmd}
eval ${run_cmd}
echo "DONE on `hostname`"
频繁遇到 下面这种 超大的loss scale, 并且显示skip the steps, 正常嘛?
[2024-09-03 02:38:17,054] [INFO] [loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 4294967296, reducing to 2147483648 [2024-09-03 02:38:39,582] [INFO] [loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 2147483648, reducing to 1073741824 [2024-09-03 02:39:01,823] [INFO] [loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 1073741824, reducing to 536870912 [2024-09-03 02:39:25,772] [INFO] [loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 536870912, reducing to 268435456 [2024-09-03 02:40:34,024] [INFO] [loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 268435456, reducing to 134217728 [2024-09-03 02:42:31,555] [INFO] [loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 134217728, reducing to 67108864 [2024-09-03 02:43:17,654] [INFO] [loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 67108864, reducing to 33554432
不正常,这个scale也不正常,你的数据集体量是
这个看上去是没有任何报错都是跳过了?
Same behavior here on 4xA100 device 👀
same on 8*A800
Did everyone skip all the steps? @kyrie111 @TianxingWu, because skipping the first few steps and then continuing with normal training, the loss reduction is a normal phenomenon. The first few steps are skipped because the loss is indeed too large
My issue is solved #261
same issue on 8 * A100 80G (Tried Single GPU & Multi GPU 8)
I tried only 2b model
sft.yaml
args:
checkpoint_activations: True ## using gradient checkpointing
model_parallel_size: 1
experiment_name: lora-test
mode: finetune
load: "/root/CogVideo/CogVideoX-2b-sat/transformer"
no_load_rng: True
train_iters: 100 # Suggest more than 1000 For Lora and SFT For 500 is enough
eval_iters: 1
eval_interval: 10
eval_batch_size: 1
save: ckpts_2b_lora
save_interval: 50
log_interval: 20
train_data: [ "/root/CogVideo/sat/datasets/test" ] # Train data path
valid_data: [ "/root/CogVideo/sat/datasets/test" ] # Validation data path, can be the same as train_data(not recommended)
split: 1,0,0
num_workers: 8
force_train: True
only_log_video_latents: True
data:
target: data_video.SFTDataset
params:
video_size: [ 480, 720 ]
fps: 8
max_num_frames: 49
skip_frms_num: 3.
deepspeed:
# Minimun for 16 videos per batch for ALL GPUs, This setting is for 8 x A100 GPUs
train_micro_batch_size_per_gpu: 2
gradient_accumulation_steps: 1
steps_per_print: 50
gradient_clipping: 0.1
zero_optimization:
stage: 2
cpu_offload: false
contiguous_gradients: false
overlap_comm: true
reduce_scatter: true
reduce_bucket_size: 1000000000
allgather_bucket_size: 1000000000
load_from_fp32_weights: false
zero_allow_untested_optimizer: true
bf16:
enabled: False # For CogVideoX-2B Turn to False and For CogVideoX-5B Turn to True
fp16:
enabled: True # For CogVideoX-2B Turn to True and For CogVideoX-5B Turn to False
loss_scale: 0
loss_scale_window: 400
hysteresis: 2
min_loss_scale: 1
optimizer:
type: sat.ops.FusedEmaAdam
params:
lr: 0.001 # Between 1E-3 and 5E-4 For Lora and 1E-5 For SFT
betas: [ 0.9, 0.95 ]
eps: 1e-8
weight_decay: 1e-4
activation_checkpointing:
partition_activations: false
contiguous_memory_optimization: false
wall_clock_breakdown: false
cogvideox_2b.yaml
model:
scale_factor: 1.15258426
disable_first_stage_autocast: true
log_keys:
- txt
denoiser_config:
target: sgm.modules.diffusionmodules.denoiser.DiscreteDenoiser
params:
num_idx: 1000
quantize_c_noise: False
weighting_config:
target: sgm.modules.diffusionmodules.denoiser_weighting.EpsWeighting
scaling_config:
target: sgm.modules.diffusionmodules.denoiser_scaling.VideoScaling
discretization_config:
target: sgm.modules.diffusionmodules.discretizer.ZeroSNRDDPMDiscretization
params:
shift_scale: 3.0
network_config:
target: dit_video_concat.DiffusionTransformer
params:
time_embed_dim: 512
elementwise_affine: True
num_frames: 49
time_compressed_rate: 4
latent_width: 90
latent_height: 60
num_layers: 30
patch_size: 2
in_channels: 16
out_channels: 16
hidden_size: 1920
adm_in_channels: 256
num_attention_heads: 30
transformer_args:
checkpoint_activations: True ## using gradient checkpointing
vocab_size: 1
max_sequence_length: 64
layernorm_order: pre
skip_init: false
model_parallel_size: 1
is_decoder: false
modules:
pos_embed_config:
target: dit_video_concat.Basic3DPositionEmbeddingMixin
params:
text_length: 226
height_interpolation: 1.875
width_interpolation: 1.875
patch_embed_config:
target: dit_video_concat.ImagePatchEmbeddingMixin
params:
text_hidden_size: 4096
adaln_layer_config:
target: dit_video_concat.AdaLNMixin
params:
qk_ln: True
final_layer_config:
target: dit_video_concat.FinalLayerMixin
conditioner_config:
target: sgm.modules.GeneralConditioner
params:
emb_models:
- is_trainable: false
input_key: txt
ucg_rate: 0.1
target: sgm.modules.encoders.modules.FrozenT5Embedder
params:
model_dir: "/root/CogVideo/t5-v1_1-xxl"
max_length: 226
first_stage_config:
target: vae_modules.autoencoder.VideoAutoencoderInferenceWrapper
params:
cp_size: 1
ckpt_path: "/root/CogVideo/CogVideoX-2b-sat/vae/3d-vae.pt"
ignore_keys: [ 'loss' ]
loss_config:
target: torch.nn.Identity
regularizer_config:
target: vae_modules.regularizers.DiagonalGaussianRegularizer
encoder_config:
target: vae_modules.cp_enc_dec.ContextParallelEncoder3D
params:
double_z: true
z_channels: 16
resolution: 256
in_channels: 3
out_ch: 3
ch: 128
ch_mult: [ 1, 2, 2, 4 ]
attn_resolutions: [ ]
num_res_blocks: 3
dropout: 0.0
gather_norm: True
decoder_config:
target: vae_modules.cp_enc_dec.ContextParallelDecoder3D
params:
double_z: True
z_channels: 16
resolution: 256
in_channels: 3
out_ch: 3
ch: 128
ch_mult: [ 1, 2, 2, 4 ]
attn_resolutions: [ ]
num_res_blocks: 3
dropout: 0.0
gather_norm: False
loss_fn_config:
target: sgm.modules.diffusionmodules.loss.VideoDiffusionLoss
params:
offset_noise_level: 0
sigma_sampler_config:
target: sgm.modules.diffusionmodules.sigma_sampling.DiscreteSampling
params:
uniform_sampling: True
num_idx: 1000
discretization_config:
target: sgm.modules.diffusionmodules.discretizer.ZeroSNRDDPMDiscretization
params:
shift_scale: 3.0
sampler_config:
target: sgm.modules.diffusionmodules.sampling.VPSDEDPMPP2MSampler
params:
num_steps: 50
verbose: True
discretization_config:
target: sgm.modules.diffusionmodules.discretizer.ZeroSNRDDPMDiscretization
params:
shift_scale: 3.0
guider_config:
target: sgm.modules.diffusionmodules.guiders.DynamicCFG
params:
scale: 6
exp: 5
num_steps: 50
[1st Trial] finetune_single_gpu.sh
RUN on alphacode-ttv-a100-80g-gpu, CUDA_VISIBLE_DEVICES=
WORLD_SIZE=1 RANK=0 LOCAL_RANK=0 LOCAL_WORLD_SIZE=1 python train_video.py --base configs/cogvideox_2b_lora.yaml configs/sft.yaml --seed 21247
[2024-09-09 16:39:11,302] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[WARNING] async_io requires the dev libaio .so object and headers but these were not found.
[WARNING] async_io: please install the libaio-dev package with apt
[WARNING] If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
[WARNING] Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
[WARNING] sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.4
[WARNING] using untested triton version (3.0.0), only 1.0.0 is known to be compatible
/root/miniconda3/envs/cogvideo/lib/python3.12/site-packages/deepspeed/runtime/zero/linear.py:47: FutureWarning: `torch.cuda.amp.custom_fwd(args...)` is deprecated. Please use `torch.amp.custom_fwd(args..., device_type='cuda')` instead.
@autocast_custom_fwd
/root/miniconda3/envs/cogvideo/lib/python3.12/site-packages/deepspeed/runtime/zero/linear.py:66: FutureWarning: `torch.cuda.amp.custom_bwd(args...)` is deprecated. Please use `torch.amp.custom_bwd(args..., device_type='cuda')` instead.
@autocast_custom_bwd
/root/miniconda3/envs/cogvideo/lib/python3.12/site-packages/kornia/feature/lightglue.py:44: FutureWarning: `torch.cuda.amp.custom_fwd(args...)` is deprecated. Please use `torch.amp.custom_fwd(args..., device_type='cuda')` instead.
@torch.cuda.amp.custom_fwd(cast_inputs=torch.float32)
no module 'xformers'. Processing without...
no module 'xformers'. Processing without...
[2024-09-09 16:39:16,571] [INFO] using world size: 1
[2024-09-09 16:39:16,571] [INFO] Will override arguments with manually specified deepspeed_config!
[W909 16:39:16.412494279 socket.cpp:697] [c10d] The client socket cannot be initialized to connect to [ip6-localhost]:39375 (errno: 97 - Address family not supported by protocol).
[W909 16:39:16.413593009 socket.cpp:697] [c10d] The client socket cannot be initialized to connect to [alphacode-ttv-a100-80g-gpu]:39375 (errno: 97 - Address family not supported by protocol).
[2024-09-09 16:39:16,591] [INFO] [RANK 0] > initializing model parallel with size 1
[2024-09-09 16:39:16,592] [INFO] [comm.py:637:init_distributed] cdb=None
[2024-09-09 16:39:16,869] [INFO] [RANK 0] building SATVideoDiffusionEngine model ...
[2024-09-09 16:39:26,340] [WARNING] [RANK 0] Failed to load bitsandbytes:No module named 'bitsandbytes'
[2024-09-09 16:39:26,340] [INFO] [RANK 0] replacing layer 0 attention with lora
[2024-09-09 16:39:26,364] [INFO] [RANK 0] replacing layer 1 attention with lora
[2024-09-09 16:39:26,387] [INFO] [RANK 0] replacing layer 2 attention with lora
[2024-09-09 16:39:26,411] [INFO] [RANK 0] replacing layer 3 attention with lora
[2024-09-09 16:39:26,487] [INFO] [RANK 0] replacing layer 4 attention with lora
[2024-09-09 16:39:26,518] [INFO] [RANK 0] replacing layer 5 attention with lora
[2024-09-09 16:39:26,542] [INFO] [RANK 0] replacing layer 6 attention with lora
[2024-09-09 16:39:26,567] [INFO] [RANK 0] replacing layer 7 attention with lora
[2024-09-09 16:39:26,591] [INFO] [RANK 0] replacing layer 8 attention with lora
[2024-09-09 16:39:26,621] [INFO] [RANK 0] replacing layer 9 attention with lora
[2024-09-09 16:39:26,726] [INFO] [RANK 0] replacing layer 10 attention with lora
[2024-09-09 16:39:26,870] [INFO] [RANK 0] replacing layer 11 attention with lora
[2024-09-09 16:39:26,999] [INFO] [RANK 0] replacing layer 12 attention with lora
[2024-09-09 16:39:27,074] [INFO] [RANK 0] replacing layer 13 attention with lora
[2024-09-09 16:39:27,127] [INFO] [RANK 0] replacing layer 14 attention with lora
[2024-09-09 16:39:27,206] [INFO] [RANK 0] replacing layer 15 attention with lora
[2024-09-09 16:39:27,294] [INFO] [RANK 0] replacing layer 16 attention with lora
[2024-09-09 16:39:27,379] [INFO] [RANK 0] replacing layer 17 attention with lora
[2024-09-09 16:39:27,446] [INFO] [RANK 0] replacing layer 18 attention with lora
[2024-09-09 16:39:27,528] [INFO] [RANK 0] replacing layer 19 attention with lora
[2024-09-09 16:39:27,642] [INFO] [RANK 0] replacing layer 20 attention with lora
[2024-09-09 16:39:27,715] [INFO] [RANK 0] replacing layer 21 attention with lora
[2024-09-09 16:39:27,794] [INFO] [RANK 0] replacing layer 22 attention with lora
[2024-09-09 16:39:27,854] [INFO] [RANK 0] replacing layer 23 attention with lora
[2024-09-09 16:39:27,930] [INFO] [RANK 0] replacing layer 24 attention with lora
[2024-09-09 16:39:27,960] [INFO] [RANK 0] replacing layer 25 attention with lora
[2024-09-09 16:39:27,982] [INFO] [RANK 0] replacing layer 26 attention with lora
[2024-09-09 16:39:28,004] [INFO] [RANK 0] replacing layer 27 attention with lora
[2024-09-09 16:39:28,026] [INFO] [RANK 0] replacing layer 28 attention with lora
[2024-09-09 16:39:28,048] [INFO] [RANK 0] replacing layer 29 attention with lora
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:01<00:00, 1.13it/s]
Initialized embedder #0: FrozenT5Embedder with 4762310656 params. Trainable: False
Working with z of shape (1, 16, 32, 32) = 16384 dimensions.
/root/CogVideo/sat/vae_modules/autoencoder.py:565: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature.
sd = torch.load(path, map_location="cpu")["state_dict"]
Deleting key loss.logvar from state_dict.
Deleting key loss.perceptual_loss.scaling_layer.shift from state_dict.
Deleting key loss.perceptual_loss.scaling_layer.scale from state_dict.
Deleting key loss.perceptual_loss.net.slice1.0.weight from state_dict.
Deleting key loss.perceptual_loss.net.slice1.0.bias from state_dict.
Deleting key loss.perceptual_loss.net.slice1.2.weight from state_dict.
Deleting key loss.perceptual_loss.net.slice1.2.bias from state_dict.
Deleting key loss.perceptual_loss.net.slice2.5.weight from state_dict.
Deleting key loss.perceptual_loss.net.slice2.5.bias from state_dict.
Deleting key loss.perceptual_loss.net.slice2.7.weight from state_dict.
Deleting key loss.perceptual_loss.net.slice2.7.bias from state_dict.
Deleting key loss.perceptual_loss.net.slice3.10.weight from state_dict.
Deleting key loss.perceptual_loss.net.slice3.10.bias from state_dict.
Deleting key loss.perceptual_loss.net.slice3.12.weight from state_dict.
Deleting key loss.perceptual_loss.net.slice3.12.bias from state_dict.
Deleting key loss.perceptual_loss.net.slice3.14.weight from state_dict.
Deleting key loss.perceptual_loss.net.slice3.14.bias from state_dict.
Deleting key loss.perceptual_loss.net.slice4.17.weight from state_dict.
Deleting key loss.perceptual_loss.net.slice4.17.bias from state_dict.
Deleting key loss.perceptual_loss.net.slice4.19.weight from state_dict.
Deleting key loss.perceptual_loss.net.slice4.19.bias from state_dict.
Deleting key loss.perceptual_loss.net.slice4.21.weight from state_dict.
Deleting key loss.perceptual_loss.net.slice4.21.bias from state_dict.
Deleting key loss.perceptual_loss.net.slice5.24.weight from state_dict.
Deleting key loss.perceptual_loss.net.slice5.24.bias from state_dict.
Deleting key loss.perceptual_loss.net.slice5.26.weight from state_dict.
Deleting key loss.perceptual_loss.net.slice5.26.bias from state_dict.
Deleting key loss.perceptual_loss.net.slice5.28.weight from state_dict.
Deleting key loss.perceptual_loss.net.slice5.28.bias from state_dict.
Deleting key loss.perceptual_loss.lin0.model.1.weight from state_dict.
Deleting key loss.perceptual_loss.lin1.model.1.weight from state_dict.
Deleting key loss.perceptual_loss.lin2.model.1.weight from state_dict.
Deleting key loss.perceptual_loss.lin3.model.1.weight from state_dict.
Deleting key loss.perceptual_loss.lin4.model.1.weight from state_dict.
Deleting key loss.discriminator.blocks.0.downsample_res.conv.weight from state_dict.
Deleting key loss.discriminator.blocks.0.downsample_res.conv.bias from state_dict.
Deleting key loss.discriminator.blocks.0.net.0.conv.weight from state_dict.
Deleting key loss.discriminator.blocks.0.net.0.conv.bias from state_dict.
Deleting key loss.discriminator.blocks.0.net.2.conv.weight from state_dict.
Deleting key loss.discriminator.blocks.0.net.2.conv.bias from state_dict.
Deleting key loss.discriminator.blocks.0.downsample.conv.weight from state_dict.
Deleting key loss.discriminator.blocks.0.downsample.conv.bias from state_dict.
Deleting key loss.discriminator.blocks.1.downsample_res.conv.weight from state_dict.
Deleting key loss.discriminator.blocks.1.downsample_res.conv.bias from state_dict.
Deleting key loss.discriminator.blocks.1.net.0.conv.weight from state_dict.
Deleting key loss.discriminator.blocks.1.net.0.conv.bias from state_dict.
Deleting key loss.discriminator.blocks.1.net.2.conv.weight from state_dict.
Deleting key loss.discriminator.blocks.1.net.2.conv.bias from state_dict.
Deleting key loss.discriminator.blocks.1.downsample.conv.weight from state_dict.
Deleting key loss.discriminator.blocks.1.downsample.conv.bias from state_dict.
Deleting key loss.discriminator.blocks.2.downsample_res.conv.weight from state_dict.
Deleting key loss.discriminator.blocks.2.downsample_res.conv.bias from state_dict.
Deleting key loss.discriminator.blocks.2.net.0.conv.weight from state_dict.
Deleting key loss.discriminator.blocks.2.net.0.conv.bias from state_dict.
Deleting key loss.discriminator.blocks.2.net.2.conv.weight from state_dict.
Deleting key loss.discriminator.blocks.2.net.2.conv.bias from state_dict.
Deleting key loss.discriminator.blocks.2.downsample.conv.weight from state_dict.
Deleting key loss.discriminator.blocks.2.downsample.conv.bias from state_dict.
Deleting key loss.discriminator.blocks.3.downsample_res.conv.weight from state_dict.
Deleting key loss.discriminator.blocks.3.downsample_res.conv.bias from state_dict.
Deleting key loss.discriminator.blocks.3.net.0.conv.weight from state_dict.
Deleting key loss.discriminator.blocks.3.net.0.conv.bias from state_dict.
Deleting key loss.discriminator.blocks.3.net.2.conv.weight from state_dict.
Deleting key loss.discriminator.blocks.3.net.2.conv.bias from state_dict.
Deleting key loss.discriminator.blocks.3.downsample.conv.weight from state_dict.
Deleting key loss.discriminator.blocks.3.downsample.conv.bias from state_dict.
Deleting key loss.discriminator.blocks.4.0.conv_res.weight from state_dict.
Deleting key loss.discriminator.blocks.4.0.conv_res.bias from state_dict.
Deleting key loss.discriminator.blocks.4.0.net.0.weight from state_dict.
Deleting key loss.discriminator.blocks.4.0.net.0.bias from state_dict.
Deleting key loss.discriminator.blocks.4.0.net.2.weight from state_dict.
Deleting key loss.discriminator.blocks.4.0.net.2.bias from state_dict.
Deleting key loss.discriminator.blocks.4.0.downsample.1.weight from state_dict.
Deleting key loss.discriminator.blocks.4.0.downsample.1.bias from state_dict.
Deleting key loss.discriminator.blocks.4.1.0.fn.norm.gamma from state_dict.
Deleting key loss.discriminator.blocks.4.1.0.fn.attn.to_q.0.weight from state_dict.
Deleting key loss.discriminator.blocks.4.1.0.fn.attn.to_kv.0.weight from state_dict.
Deleting key loss.discriminator.blocks.4.1.0.fn.attn.to_out.0.weight from state_dict.
Deleting key loss.discriminator.blocks.4.1.1.fn.norm.gamma from state_dict.
Deleting key loss.discriminator.blocks.4.1.1.fn.net.0.weight from state_dict.
Deleting key loss.discriminator.blocks.4.1.1.fn.net.0.bias from state_dict.
Deleting key loss.discriminator.blocks.4.1.1.fn.net.2.weight from state_dict.
Deleting key loss.discriminator.blocks.4.1.1.fn.net.2.bias from state_dict.
Deleting key loss.discriminator.blocks.5.0.conv_res.weight from state_dict.
Deleting key loss.discriminator.blocks.5.0.conv_res.bias from state_dict.
Deleting key loss.discriminator.blocks.5.0.net.0.weight from state_dict.
Deleting key loss.discriminator.blocks.5.0.net.0.bias from state_dict.
Deleting key loss.discriminator.blocks.5.0.net.2.weight from state_dict.
Deleting key loss.discriminator.blocks.5.0.net.2.bias from state_dict.
Deleting key loss.discriminator.blocks.5.0.downsample.1.weight from state_dict.
Deleting key loss.discriminator.blocks.5.0.downsample.1.bias from state_dict.
Deleting key loss.discriminator.blocks.5.1.0.fn.norm.gamma from state_dict.
Deleting key loss.discriminator.blocks.5.1.0.fn.attn.to_q.0.weight from state_dict.
Deleting key loss.discriminator.blocks.5.1.0.fn.attn.to_kv.0.weight from state_dict.
Deleting key loss.discriminator.blocks.5.1.0.fn.attn.to_out.0.weight from state_dict.
Deleting key loss.discriminator.blocks.5.1.1.fn.norm.gamma from state_dict.
Deleting key loss.discriminator.blocks.5.1.1.fn.net.0.weight from state_dict.
Deleting key loss.discriminator.blocks.5.1.1.fn.net.0.bias from state_dict.
Deleting key loss.discriminator.blocks.5.1.1.fn.net.2.weight from state_dict.
Deleting key loss.discriminator.blocks.5.1.1.fn.net.2.bias from state_dict.
Deleting key loss.discriminator.blocks.6.0.conv_res.weight from state_dict.
Deleting key loss.discriminator.blocks.6.0.conv_res.bias from state_dict.
Deleting key loss.discriminator.blocks.6.0.net.0.weight from state_dict.
Deleting key loss.discriminator.blocks.6.0.net.0.bias from state_dict.
Deleting key loss.discriminator.blocks.6.0.net.2.weight from state_dict.
Deleting key loss.discriminator.blocks.6.0.net.2.bias from state_dict.
Deleting key loss.discriminator.blocks.6.1.0.fn.norm.gamma from state_dict.
Deleting key loss.discriminator.blocks.6.1.0.fn.attn.to_q.0.weight from state_dict.
Deleting key loss.discriminator.blocks.6.1.0.fn.attn.to_kv.0.weight from state_dict.
Deleting key loss.discriminator.blocks.6.1.0.fn.attn.to_out.0.weight from state_dict.
Deleting key loss.discriminator.blocks.6.1.1.fn.norm.gamma from state_dict.
Deleting key loss.discriminator.blocks.6.1.1.fn.net.0.weight from state_dict.
Deleting key loss.discriminator.blocks.6.1.1.fn.net.0.bias from state_dict.
Deleting key loss.discriminator.blocks.6.1.1.fn.net.2.weight from state_dict.
Deleting key loss.discriminator.blocks.6.1.1.fn.net.2.bias from state_dict.
Deleting key loss.discriminator.to_logits.0.weight from state_dict.
Deleting key loss.discriminator.to_logits.0.bias from state_dict.
Deleting key loss.discriminator.to_logits.3.weight from state_dict.
Deleting key loss.discriminator.to_logits.3.bias from state_dict.
Missing keys: []
Unexpected keys: []
Restored from /root/CogVideo/CogVideoX-2b-sat/vae/3d-vae.pt
[2024-09-09 16:39:32,189] [INFO] [RANK 0] > number of parameters on model parallel rank 0: 6764790755
[2024-09-09 16:39:42,369] [INFO] [RANK 0] global rank 0 is loading checkpoint /root/CogVideo/CogVideoX-2b-sat/transformer/1000/mp_rank_00_model_states.pt
/root/miniconda3/envs/cogvideo/lib/python3.12/site-packages/sat/training/model_io.py:286: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature.
sd = torch.load(checkpoint_name, map_location='cpu')
[2024-09-09 16:39:43,764] [INFO] [RANK 0] > successfully loaded /root/CogVideo/CogVideoX-2b-sat/transformer/1000/mp_rank_00_model_states.pt
[2024-09-09 16:39:45,132] [INFO] [RANK 0] ***** Total trainable parameters: 58982400 *****
[2024-09-09 16:39:45,132] [INFO] [RANK 0] [<class 'sat.ops.layernorm.LayerNorm'>, <class 'torch.nn.modules.normalization.LayerNorm'>, <class 'sat.ops.layernorm.RMSNorm'>] is set to no_weight_decay
[2024-09-09 16:39:45,136] [INFO] [RANK 0] Syncing initialized parameters...
[2024-09-09 16:39:45,239] [INFO] [RANK 0] Finished syncing initialized parameters.
[2024-09-09 16:39:45,239] [INFO] [RANK 0] Using optimizer sat.ops.FusedEmaAdam from sat.
[2024-09-09 16:39:45,239] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed info: version=0.14.4, git-hash=unknown, git-branch=unknown
[2024-09-09 16:39:45,240] [WARNING] [config_utils.py:69:_process_deprecated_field] Config parameter cpu_offload is deprecated use offload_optimizer instead
[2024-09-09 16:39:45,337] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Flops Profiler Enabled: False
Using /root/.cache/torch_extensions/py312_cu121 as PyTorch extensions root...
Detected CUDA files, patching ldflags
Emitting ninja build file /root/.cache/torch_extensions/py312_cu121/fused_ema_adam/build.ninja...
/root/miniconda3/envs/cogvideo/lib/python3.12/site-packages/torch/utils/cpp_extension.py:1965: UserWarning: TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation.
If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'].
warnings.warn(
Building extension module fused_ema_adam...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module fused_ema_adam...
Time to load fused_ema_adam op: 0.7258331775665283 seconds
[2024-09-09 16:39:46,219] [INFO] [logging.py:96:log_dist] [Rank 0] Using client callable to create basic optimizer
[2024-09-09 16:39:46,219] [INFO] [logging.py:96:log_dist] [Rank 0] Removing param_group that has no 'params' in the basic Optimizer
[2024-09-09 16:39:46,239] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Basic Optimizer = FusedEmaAdam
[2024-09-09 16:39:46,239] [INFO] [utils.py:56:is_zero_supported_optimizer] Checking ZeRO support for optimizer=FusedEmaAdam type=<class 'sat.ops.fused_ema_adam.FusedEmaAdam'>
[2024-09-09 16:39:46,239] [WARNING] [engine.py:1179:_do_optimizer_sanity_check] **** You are using ZeRO with an untested optimizer, proceed with caution *****
[2024-09-09 16:39:46,239] [INFO] [logging.py:96:log_dist] [Rank 0] Creating torch.float16 ZeRO stage 2 optimizer
[2024-09-09 16:39:46,239] [INFO] [stage_1_and_2.py:148:__init__] Reduce bucket size 1000000000
[2024-09-09 16:39:46,239] [INFO] [stage_1_and_2.py:149:__init__] Allgather bucket size 1000000000
[2024-09-09 16:39:46,239] [INFO] [stage_1_and_2.py:150:__init__] CPU Offload: False
[2024-09-09 16:39:46,239] [INFO] [stage_1_and_2.py:151:__init__] Round robin gradient partitioning: False
[2024-09-09 16:39:48,450] [INFO] [utils.py:781:see_memory_usage] Before initializing optimizer states
[2024-09-09 16:39:48,450] [INFO] [utils.py:782:see_memory_usage] MA 12.86 GB Max_MA 12.97 GB CA 13.23 GB Max_CA 13 GB
[2024-09-09 16:39:48,451] [INFO] [utils.py:789:see_memory_usage] CPU Virtual Memory: used = 31.17 GB, percent = 1.6%
[2024-09-09 16:39:48,690] [INFO] [utils.py:781:see_memory_usage] After initializing optimizer states
[2024-09-09 16:39:48,691] [INFO] [utils.py:782:see_memory_usage] MA 12.86 GB Max_MA 13.08 GB CA 13.45 GB Max_CA 13 GB
[2024-09-09 16:39:48,691] [INFO] [utils.py:789:see_memory_usage] CPU Virtual Memory: used = 31.19 GB, percent = 1.6%
[2024-09-09 16:39:48,691] [INFO] [stage_1_and_2.py:543:__init__] optimizer state initialized
[2024-09-09 16:39:48,948] [INFO] [utils.py:781:see_memory_usage] After initializing ZeRO optimizer
[2024-09-09 16:39:48,949] [INFO] [utils.py:782:see_memory_usage] MA 12.86 GB Max_MA 12.86 GB CA 13.45 GB Max_CA 13 GB
[2024-09-09 16:39:48,949] [INFO] [utils.py:789:see_memory_usage] CPU Virtual Memory: used = 31.25 GB, percent = 1.6%
[2024-09-09 16:39:48,953] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Final Optimizer = DeepSpeedZeroOptimizer
[2024-09-09 16:39:48,953] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed using client LR scheduler
[2024-09-09 16:39:48,953] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed LR Scheduler = None
[2024-09-09 16:39:48,954] [INFO] [logging.py:96:log_dist] [Rank 0] step=0, skipped=0, lr=[1.0], mom=[[0.9, 0.95]]
[2024-09-09 16:39:48,956] [INFO] [config.py:997:print] DeepSpeedEngine configuration:
[2024-09-09 16:39:48,957] [INFO] [config.py:1001:print] activation_checkpointing_config {
"partition_activations": false,
"contiguous_memory_optimization": false,
"cpu_checkpointing": false,
"number_checkpoints": null,
"synchronize_checkpoint_boundary": false,
"profile": false
}
[2024-09-09 16:39:48,957] [INFO] [config.py:1001:print] aio_config ................... {'block_size': 1048576, 'queue_depth': 8, 'thread_count': 1, 'single_submit': False, 'overlap_events': True}
[2024-09-09 16:39:48,957] [INFO] [config.py:1001:print] amp_enabled .................. False
[2024-09-09 16:39:48,957] [INFO] [config.py:1001:print] amp_params ................... False
[2024-09-09 16:39:48,958] [INFO] [config.py:1001:print] autotuning_config ............ {
"enabled": false,
"start_step": null,
"end_step": null,
"metric_path": null,
"arg_mappings": null,
"metric": "throughput",
"model_info": null,
"results_dir": "autotuning_results",
"exps_dir": "autotuning_exps",
"overwrite": true,
"fast": true,
"start_profile_step": 3,
"end_profile_step": 5,
"tuner_type": "gridsearch",
"tuner_early_stopping": 5,
"tuner_num_trials": 50,
"model_info_path": null,
"mp_size": 1,
"max_train_batch_size": null,
"min_train_batch_size": 1,
"max_train_micro_batch_size_per_gpu": 1.024000e+03,
"min_train_micro_batch_size_per_gpu": 1,
"num_tuning_micro_batch_sizes": 3
}
[2024-09-09 16:39:48,958] [INFO] [config.py:1001:print] bfloat16_enabled ............. False
[2024-09-09 16:39:48,958] [INFO] [config.py:1001:print] bfloat16_immediate_grad_update False
[2024-09-09 16:39:48,958] [INFO] [config.py:1001:print] checkpoint_parallel_write_pipeline False
[2024-09-09 16:39:48,958] [INFO] [config.py:1001:print] checkpoint_tag_validation_enabled True
[2024-09-09 16:39:48,958] [INFO] [config.py:1001:print] checkpoint_tag_validation_fail False
[2024-09-09 16:39:48,958] [INFO] [config.py:1001:print] comms_config ................. <deepspeed.comm.config.DeepSpeedCommsConfig object at 0x7f0915235d60>
[2024-09-09 16:39:48,958] [INFO] [config.py:1001:print] communication_data_type ...... None
[2024-09-09 16:39:48,958] [INFO] [config.py:1001:print] compression_config ........... {'weight_quantization': {'shared_parameters': {'enabled': False, 'quantizer_kernel': False, 'schedule_offset': 0, 'quantize_groups': 1, 'quantize_verbose': False, 'quantization_type': 'symmetric', 'quantize_weight_in_forward': False, 'rounding': 'nearest', 'fp16_mixed_quantize': False, 'quantize_change_ratio': 0.001}, 'different_groups': {}}, 'activation_quantization': {'shared_parameters': {'enabled': False, 'quantization_type': 'symmetric', 'range_calibration': 'dynamic', 'schedule_offset': 1000}, 'different_groups': {}}, 'sparse_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'row_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'head_pruning': {'shared_parameters': {'enabled': False, 'method': 'topk', 'schedule_offset': 1000}, 'different_groups': {}}, 'channel_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'layer_reduction': {'enabled': False}}
[2024-09-09 16:39:48,958] [INFO] [config.py:1001:print] curriculum_enabled_legacy .... False
[2024-09-09 16:39:48,958] [INFO] [config.py:1001:print] curriculum_params_legacy ..... False
[2024-09-09 16:39:48,958] [INFO] [config.py:1001:print] data_efficiency_config ....... {'enabled': False, 'seed': 1234, 'data_sampling': {'enabled': False, 'num_epochs': 1000, 'num_workers': 0, 'curriculum_learning': {'enabled': False}}, 'data_routing': {'enabled': False, 'random_ltd': {'enabled': False, 'layer_token_lr_schedule': {'enabled': False}}}}
[2024-09-09 16:39:48,958] [INFO] [config.py:1001:print] data_efficiency_enabled ...... False
[2024-09-09 16:39:48,958] [INFO] [config.py:1001:print] dataloader_drop_last ......... False
[2024-09-09 16:39:48,958] [INFO] [config.py:1001:print] disable_allgather ............ False
[2024-09-09 16:39:48,958] [INFO] [config.py:1001:print] dump_state ................... False
[2024-09-09 16:39:48,958] [INFO] [config.py:1001:print] dynamic_loss_scale_args ...... None
[2024-09-09 16:39:48,958] [INFO] [config.py:1001:print] eigenvalue_enabled ........... False
[2024-09-09 16:39:48,958] [INFO] [config.py:1001:print] eigenvalue_gas_boundary_resolution 1
[2024-09-09 16:39:48,958] [INFO] [config.py:1001:print] eigenvalue_layer_name ........ bert.encoder.layer
[2024-09-09 16:39:48,958] [INFO] [config.py:1001:print] eigenvalue_layer_num ......... 0
[2024-09-09 16:39:48,958] [INFO] [config.py:1001:print] eigenvalue_max_iter .......... 100
[2024-09-09 16:39:48,958] [INFO] [config.py:1001:print] eigenvalue_stability ......... 1e-06
[2024-09-09 16:39:48,958] [INFO] [config.py:1001:print] eigenvalue_tol ............... 0.01
[2024-09-09 16:39:48,958] [INFO] [config.py:1001:print] eigenvalue_verbose ........... False
[2024-09-09 16:39:48,958] [INFO] [config.py:1001:print] elasticity_enabled ........... False
[2024-09-09 16:39:48,958] [INFO] [config.py:1001:print] flops_profiler_config ........ {
"enabled": false,
"recompute_fwd_factor": 0.0,
"profile_step": 1,
"module_depth": -1,
"top_modules": 1,
"detailed": true,
"output_file": null
}
[2024-09-09 16:39:48,959] [INFO] [config.py:1001:print] fp16_auto_cast ............... False
[2024-09-09 16:39:48,959] [INFO] [config.py:1001:print] fp16_enabled ................. True
[2024-09-09 16:39:48,959] [INFO] [config.py:1001:print] fp16_master_weights_and_gradients False
[2024-09-09 16:39:48,959] [INFO] [config.py:1001:print] global_rank .................. 0
[2024-09-09 16:39:48,959] [INFO] [config.py:1001:print] grad_accum_dtype ............. None
[2024-09-09 16:39:48,959] [INFO] [config.py:1001:print] gradient_accumulation_steps .. 1
[2024-09-09 16:39:48,959] [INFO] [config.py:1001:print] gradient_clipping ............ 0.1
[2024-09-09 16:39:48,959] [INFO] [config.py:1001:print] gradient_predivide_factor .... 1.0
[2024-09-09 16:39:48,959] [INFO] [config.py:1001:print] graph_harvesting ............. False
[2024-09-09 16:39:48,959] [INFO] [config.py:1001:print] hybrid_engine ................ enabled=False max_out_tokens=512 inference_tp_size=1 release_inference_cache=False pin_parameters=True tp_gather_partition_size=8
[2024-09-09 16:39:48,959] [INFO] [config.py:1001:print] initial_dynamic_scale ........ 65536
[2024-09-09 16:39:48,959] [INFO] [config.py:1001:print] load_universal_checkpoint .... False
[2024-09-09 16:39:48,959] [INFO] [config.py:1001:print] loss_scale ................... 0
[2024-09-09 16:39:48,959] [INFO] [config.py:1001:print] memory_breakdown ............. False
[2024-09-09 16:39:48,959] [INFO] [config.py:1001:print] mics_hierarchial_params_gather False
[2024-09-09 16:39:48,959] [INFO] [config.py:1001:print] mics_shard_size .............. -1
[2024-09-09 16:39:48,959] [INFO] [config.py:1001:print] monitor_config ............... tensorboard=TensorBoardConfig(enabled=False, output_path='', job_name='DeepSpeedJobName') comet=CometConfig(enabled=False, samples_log_interval=100, project=None, workspace=None, api_key=None, experiment_name=None, experiment_key=None, online=None, mode=None) wandb=WandbConfig(enabled=False, group=None, team=None, project='deepspeed') csv_monitor=CSVConfig(enabled=False, output_path='', job_name='DeepSpeedJobName') enabled=False
[2024-09-09 16:39:48,959] [INFO] [config.py:1001:print] nebula_config ................ {
"enabled": false,
"persistent_storage_path": null,
"persistent_time_interval": 100,
"num_of_version_in_retention": 2,
"enable_nebula_load": true,
"load_path": null
}
[2024-09-09 16:39:48,959] [INFO] [config.py:1001:print] optimizer_legacy_fusion ...... False
[2024-09-09 16:39:48,959] [INFO] [config.py:1001:print] optimizer_name ............... None
[2024-09-09 16:39:48,959] [INFO] [config.py:1001:print] optimizer_params ............. None
[2024-09-09 16:39:48,959] [INFO] [config.py:1001:print] pipeline ..................... {'stages': 'auto', 'partition': 'best', 'seed_layers': False, 'activation_checkpoint_interval': 0, 'pipe_partitioned': True, 'grad_partitioned': True}
[2024-09-09 16:39:48,959] [INFO] [config.py:1001:print] pld_enabled .................. False
[2024-09-09 16:39:48,959] [INFO] [config.py:1001:print] pld_params ................... False
[2024-09-09 16:39:48,959] [INFO] [config.py:1001:print] prescale_gradients ........... False
[2024-09-09 16:39:48,959] [INFO] [config.py:1001:print] scheduler_name ............... None
[2024-09-09 16:39:48,959] [INFO] [config.py:1001:print] scheduler_params ............. None
[2024-09-09 16:39:48,959] [INFO] [config.py:1001:print] seq_parallel_communication_data_type torch.float32
[2024-09-09 16:39:48,959] [INFO] [config.py:1001:print] sparse_attention ............. None
[2024-09-09 16:39:48,960] [INFO] [config.py:1001:print] sparse_gradients_enabled ..... False
[2024-09-09 16:39:48,960] [INFO] [config.py:1001:print] steps_per_print .............. 50
[2024-09-09 16:39:48,960] [INFO] [config.py:1001:print] timers_config ................ enabled=True synchronized=True
[2024-09-09 16:39:48,960] [INFO] [config.py:1001:print] train_batch_size ............. 2
[2024-09-09 16:39:48,960] [INFO] [config.py:1001:print] train_micro_batch_size_per_gpu 2
[2024-09-09 16:39:48,960] [INFO] [config.py:1001:print] use_data_before_expert_parallel_ False
[2024-09-09 16:39:48,960] [INFO] [config.py:1001:print] use_node_local_storage ....... False
[2024-09-09 16:39:48,960] [INFO] [config.py:1001:print] wall_clock_breakdown ......... False
[2024-09-09 16:39:48,960] [INFO] [config.py:1001:print] weight_quantization_config ... None
[2024-09-09 16:39:48,960] [INFO] [config.py:1001:print] world_size ................... 1
[2024-09-09 16:39:48,960] [INFO] [config.py:1001:print] zero_allow_untested_optimizer True
[2024-09-09 16:39:48,960] [INFO] [config.py:1001:print] zero_config .................. stage=2 contiguous_gradients=False reduce_scatter=True reduce_bucket_size=1000000000 use_multi_rank_bucket_allreduce=True allgather_partitions=True allgather_bucket_size=1000000000 overlap_comm=True load_from_fp32_weights=False elastic_checkpoint=False offload_param=None offload_optimizer=None sub_group_size=1,000,000,000 cpu_offload_param=None cpu_offload_use_pin_memory=None prefetch_bucket_size=50,000,000 param_persistence_threshold=100,000 model_persistence_threshold=sys.maxsize max_live_parameters=1,000,000,000 max_reuse_distance=1,000,000,000 gather_16bit_weights_on_model_save=False use_all_reduce_for_fetch_params=False stage3_gather_fp16_weights_on_model_save=False ignore_unused_parameters=True legacy_stage1=False round_robin_gradients=False zero_hpz_partition_size=1 zero_quantized_weights=False zero_quantized_nontrainable_weights=False zero_quantized_gradients=False mics_shard_size=-1 mics_hierarchical_params_gather=False memory_efficient_linear=True pipeline_loading_checkpoint=False override_module_apply=True
[2024-09-09 16:39:48,960] [INFO] [config.py:1001:print] zero_enabled ................. True
[2024-09-09 16:39:48,960] [INFO] [config.py:1001:print] zero_force_ds_cpu_optimizer .. True
[2024-09-09 16:39:48,960] [INFO] [config.py:1001:print] zero_optimization_stage ...... 2
[2024-09-09 16:39:48,960] [INFO] [config.py:987:print_user_config] json = {
"train_micro_batch_size_per_gpu": 2,
"gradient_accumulation_steps": 1,
"steps_per_print": 50,
"gradient_clipping": 0.1,
"zero_optimization": {
"stage": 2,
"cpu_offload": false,
"contiguous_gradients": false,
"overlap_comm": true,
"reduce_scatter": true,
"reduce_bucket_size": 1.000000e+09,
"allgather_bucket_size": 1.000000e+09,
"load_from_fp32_weights": false
},
"zero_allow_untested_optimizer": true,
"bf16": {
"enabled": false
},
"fp16": {
"enabled": true
},
"loss_scale": 0,
"loss_scale_window": 400,
"hysteresis": 2,
"min_loss_scale": 1,
"activation_checkpointing": {
"partition_activations": false,
"contiguous_memory_optimization": false
},
"wall_clock_breakdown": false
}
[2024-09-09 16:39:48,960] [INFO] [RANK 0] learning rate decaying style linear, ratio 10.0
[2024-09-09 16:39:48,960] [INFO] [RANK 0] Finetuning Model...
[2024-09-09 16:39:48,960] [INFO] [RANK 0] arguments:
[2024-09-09 16:39:48,960] [INFO] [RANK 0] base ......................... ['configs/cogvideox_2b_lora.yaml', 'configs/sft.yaml']
[2024-09-09 16:39:48,960] [INFO] [RANK 0] model_parallel_size .......... 1
[2024-09-09 16:39:48,960] [INFO] [RANK 0] force_pretrain ............... False
[2024-09-09 16:39:48,961] [INFO] [RANK 0] device ....................... 0
[2024-09-09 16:39:48,961] [INFO] [RANK 0] debug ........................ False
[2024-09-09 16:39:48,961] [INFO] [RANK 0] log_image .................... True
[2024-09-09 16:39:48,961] [INFO] [RANK 0] output_dir ................... samples
[2024-09-09 16:39:48,961] [INFO] [RANK 0] input_dir .................... None
[2024-09-09 16:39:48,961] [INFO] [RANK 0] input_type ................... cli
[2024-09-09 16:39:48,961] [INFO] [RANK 0] input_file ................... input.txt
[2024-09-09 16:39:48,961] [INFO] [RANK 0] final_size ................... 2048
[2024-09-09 16:39:48,961] [INFO] [RANK 0] sdedit ....................... False
[2024-09-09 16:39:48,961] [INFO] [RANK 0] grid_num_rows ................ 1
[2024-09-09 16:39:48,961] [INFO] [RANK 0] force_inference .............. False
[2024-09-09 16:39:48,961] [INFO] [RANK 0] lcm_steps .................... None
[2024-09-09 16:39:48,961] [INFO] [RANK 0] sampling_num_frames .......... 32
[2024-09-09 16:39:48,961] [INFO] [RANK 0] sampling_fps ................. 8
[2024-09-09 16:39:48,961] [INFO] [RANK 0] only_save_latents ............ False
[2024-09-09 16:39:48,961] [INFO] [RANK 0] only_log_video_latents ....... True
[2024-09-09 16:39:48,961] [INFO] [RANK 0] latent_channels .............. 32
[2024-09-09 16:39:48,961] [INFO] [RANK 0] image2video .................. False
[2024-09-09 16:39:48,961] [INFO] [RANK 0] experiment_name .............. lora-test-09-09-16-39
[2024-09-09 16:39:48,961] [INFO] [RANK 0] train_iters .................. 100
[2024-09-09 16:39:48,961] [INFO] [RANK 0] batch_size ................... 2
[2024-09-09 16:39:48,961] [INFO] [RANK 0] lr ........................... 0.001
[2024-09-09 16:39:48,961] [INFO] [RANK 0] mode ......................... finetune
[2024-09-09 16:39:48,961] [INFO] [RANK 0] seed ......................... 21247
[2024-09-09 16:39:48,961] [INFO] [RANK 0] zero_stage ................... 0
[2024-09-09 16:39:48,961] [INFO] [RANK 0] checkpoint_activations ....... True
[2024-09-09 16:39:48,961] [INFO] [RANK 0] checkpoint_num_layers ........ 1
[2024-09-09 16:39:48,961] [INFO] [RANK 0] checkpoint_skip_layers ....... 0
[2024-09-09 16:39:48,961] [INFO] [RANK 0] fp16 ......................... True
[2024-09-09 16:39:48,961] [INFO] [RANK 0] bf16 ......................... False
[2024-09-09 16:39:48,962] [INFO] [RANK 0] gradient_accumulation_steps .. 1
[2024-09-09 16:39:48,962] [INFO] [RANK 0] profiling .................... -1
[2024-09-09 16:39:48,962] [INFO] [RANK 0] epochs ....................... None
[2024-09-09 16:39:48,962] [INFO] [RANK 0] log_interval ................. 20
[2024-09-09 16:39:48,962] [INFO] [RANK 0] summary_dir ..................
[2024-09-09 16:39:48,962] [INFO] [RANK 0] save_args .................... False
[2024-09-09 16:39:48,962] [INFO] [RANK 0] lr_decay_iters ............... None
[2024-09-09 16:39:48,962] [INFO] [RANK 0] lr_decay_style ............... linear
[2024-09-09 16:39:48,962] [INFO] [RANK 0] lr_decay_ratio ............... 0.1
[2024-09-09 16:39:48,962] [INFO] [RANK 0] warmup ....................... 0.01
[2024-09-09 16:39:48,962] [INFO] [RANK 0] weight_decay ................. 0.0001
[2024-09-09 16:39:48,962] [INFO] [RANK 0] save ......................... ckpts_2b_lora/lora-test-09-09-16-39
[2024-09-09 16:39:48,962] [INFO] [RANK 0] load ......................... /root/CogVideo/CogVideoX-2b-sat/transformer
[2024-09-09 16:39:48,962] [INFO] [RANK 0] force_train .................. True
[2024-09-09 16:39:48,962] [INFO] [RANK 0] save_interval ................ 50
[2024-09-09 16:39:48,962] [INFO] [RANK 0] no_save_rng .................. False
[2024-09-09 16:39:48,962] [INFO] [RANK 0] no_load_rng .................. True
[2024-09-09 16:39:48,962] [INFO] [RANK 0] resume_dataloader ............ False
[2024-09-09 16:39:48,962] [INFO] [RANK 0] distributed_backend .......... nccl
[2024-09-09 16:39:48,962] [INFO] [RANK 0] local_rank ................... 0
[2024-09-09 16:39:48,962] [INFO] [RANK 0] exit_interval ................ None
[2024-09-09 16:39:48,962] [INFO] [RANK 0] wandb ........................ False
[2024-09-09 16:39:48,962] [INFO] [RANK 0] wandb_project_name ........... default_project
[2024-09-09 16:39:48,962] [INFO] [RANK 0] eval_batch_size .............. 1
[2024-09-09 16:39:48,962] [INFO] [RANK 0] eval_iters ................... 1
[2024-09-09 16:39:48,962] [INFO] [RANK 0] eval_interval ................ 10
[2024-09-09 16:39:48,962] [INFO] [RANK 0] strict_eval .................. False
[2024-09-09 16:39:48,962] [INFO] [RANK 0] train_data ................... ['/root/CogVideo/sat/datasets/test']
[2024-09-09 16:39:48,962] [INFO] [RANK 0] train_data_weights ........... None
[2024-09-09 16:39:48,962] [INFO] [RANK 0] iterable_dataset ............. False
[2024-09-09 16:39:48,963] [INFO] [RANK 0] iterable_dataset_eval ........
[2024-09-09 16:39:48,963] [INFO] [RANK 0] batch_from_same_dataset ...... False
[2024-09-09 16:39:48,963] [INFO] [RANK 0] valid_data ................... ['/root/CogVideo/sat/datasets/test']
[2024-09-09 16:39:48,963] [INFO] [RANK 0] test_data .................... None
[2024-09-09 16:39:48,963] [INFO] [RANK 0] split ........................ 1,0,0
[2024-09-09 16:39:48,963] [INFO] [RANK 0] num_workers .................. 8
[2024-09-09 16:39:48,963] [INFO] [RANK 0] block_size ................... 10000
[2024-09-09 16:39:48,963] [INFO] [RANK 0] prefetch_factor .............. 4
[2024-09-09 16:39:48,963] [INFO] [RANK 0] deepspeed .................... True
[2024-09-09 16:39:48,963] [INFO] [RANK 0] deepspeed_config ............. {'train_micro_batch_size_per_gpu': 2, 'gradient_accumulation_steps': 1, 'steps_per_print': 50, 'gradient_clipping': 0.1, 'zero_optimization': {'stage': 2, 'cpu_offload': False, 'contiguous_gradients': False, 'overlap_comm': True, 'reduce_scatter': True, 'reduce_bucket_size': 1000000000, 'allgather_bucket_size': 1000000000, 'load_from_fp32_weights': False}, 'zero_allow_untested_optimizer': True, 'bf16': {'enabled': False}, 'fp16': {'enabled': True}, 'loss_scale': 0, 'loss_scale_window': 400, 'hysteresis': 2, 'min_loss_scale': 1, 'activation_checkpointing': {'partition_activations': False, 'contiguous_memory_optimization': False}, 'wall_clock_breakdown': False}
[2024-09-09 16:39:48,963] [INFO] [RANK 0] deepscale .................... False
[2024-09-09 16:39:48,963] [INFO] [RANK 0] deepscale_config ............. None
[2024-09-09 16:39:48,963] [INFO] [RANK 0] model_config ................. {'scale_factor': 1.15258426, 'disable_first_stage_autocast': True, 'not_trainable_prefixes': ['all'], 'log_keys': ['txt'], 'denoiser_config': {'target': 'sgm.modules.diffusionmodules.denoiser.DiscreteDenoiser', 'params': {'num_idx': 1000, 'quantize_c_noise': False, 'weighting_config': {'target': 'sgm.modules.diffusionmodules.denoiser_weighting.EpsWeighting'}, 'scaling_config': {'target': 'sgm.modules.diffusionmodules.denoiser_scaling.VideoScaling'}, 'discretization_config': {'target': 'sgm.modules.diffusionmodules.discretizer.ZeroSNRDDPMDiscretization', 'params': {'shift_scale': 3.0}}}}, 'network_config': {'target': 'dit_video_concat.DiffusionTransformer', 'params': {'time_embed_dim': 512, 'elementwise_affine': True, 'num_frames': 49, 'time_compressed_rate': 4, 'latent_width': 90, 'latent_height': 60, 'num_layers': 30, 'patch_size': 2, 'in_channels': 16, 'out_channels': 16, 'hidden_size': 1920, 'adm_in_channels': 256, 'num_attention_heads': 30, 'transformer_args': {'checkpoint_activations': True, 'vocab_size': 1, 'max_sequence_length': 64, 'layernorm_order': 'pre', 'skip_init': False, 'model_parallel_size': 1, 'is_decoder': False, 'num_layers': 30, 'hidden_size': 1920, 'num_attention_heads': 30, 'parallel_output': True}, 'modules': {'pos_embed_config': {'target': 'dit_video_concat.Basic3DPositionEmbeddingMixin', 'params': {'text_length': 226, 'height_interpolation': 1.875, 'width_interpolation': 1.875}}, 'lora_config': {'target': 'sat.model.finetune.lora2.LoraMixin', 'params': {'r': 128}}, 'patch_embed_config': {'target': 'dit_video_concat.ImagePatchEmbeddingMixin', 'params': {'text_hidden_size': 4096}}, 'adaln_layer_config': {'target': 'dit_video_concat.AdaLNMixin', 'params': {'qk_ln': True}}, 'final_layer_config': {'target': 'dit_video_concat.FinalLayerMixin'}}, 'dtype': 'fp16'}}, 'conditioner_config': {'target': 'sgm.modules.GeneralConditioner', 'params': {'emb_models': [{'is_trainable': False, 'input_key': 'txt', 'ucg_rate': 0.1, 'target': 'sgm.modules.encoders.modules.FrozenT5Embedder', 'params': {'model_dir': '/root/CogVideo/t5-v1_1-xxl', 'max_length': 226}}]}}, 'first_stage_config': {'target': 'vae_modules.autoencoder.VideoAutoencoderInferenceWrapper', 'params': {'cp_size': 1, 'ckpt_path': '/root/CogVideo/CogVideoX-2b-sat/vae/3d-vae.pt', 'ignore_keys': ['loss'], 'loss_config': {'target': 'torch.nn.Identity'}, 'regularizer_config': {'target': 'vae_modules.regularizers.DiagonalGaussianRegularizer'}, 'encoder_config': {'target': 'vae_modules.cp_enc_dec.ContextParallelEncoder3D', 'params': {'double_z': True, 'z_channels': 16, 'resolution': 256, 'in_channels': 3, 'out_ch': 3, 'ch': 128, 'ch_mult': [1, 2, 2, 4], 'attn_resolutions': [], 'num_res_blocks': 3, 'dropout': 0.0, 'gather_norm': True}}, 'decoder_config': {'target': 'vae_modules.cp_enc_dec.ContextParallelDecoder3D', 'params': {'double_z': True, 'z_channels': 16, 'resolution': 256, 'in_channels': 3, 'out_ch': 3, 'ch': 128, 'ch_mult': [1, 2, 2, 4], 'attn_resolutions': [], 'num_res_blocks': 3, 'dropout': 0.0, 'gather_norm': False}}}}, 'loss_fn_config': {'target': 'sgm.modules.diffusionmodules.loss.VideoDiffusionLoss', 'params': {'offset_noise_level': 0, 'sigma_sampler_config': {'target': 'sgm.modules.diffusionmodules.sigma_sampling.DiscreteSampling', 'params': {'uniform_sampling': True, 'num_idx': 1000, 'discretization_config': {'target': 'sgm.modules.diffusionmodules.discretizer.ZeroSNRDDPMDiscretization', 'params': {'shift_scale': 3.0}}}}}}, 'sampler_config': {'target': 'sgm.modules.diffusionmodules.sampling.VPSDEDPMPP2MSampler', 'params': {'num_steps': 50, 'verbose': True, 'discretization_config': {'target': 'sgm.modules.diffusionmodules.discretizer.ZeroSNRDDPMDiscretization', 'params': {'shift_scale': 3.0}}, 'guider_config': {'target': 'sgm.modules.diffusionmodules.guiders.DynamicCFG', 'params': {'scale': 6, 'exp': 5, 'num_steps': 50}}}}}
[2024-09-09 16:39:48,963] [INFO] [RANK 0] data_config .................. {'target': 'data_video.SFTDataset', 'params': {'video_size': [480, 720], 'fps': 8, 'max_num_frames': 49, 'skip_frms_num': 3.0}}
[2024-09-09 16:39:48,963] [INFO] [RANK 0] cuda ......................... True
[2024-09-09 16:39:48,963] [INFO] [RANK 0] rank ......................... 0
[2024-09-09 16:39:48,963] [INFO] [RANK 0] world_size ................... 1
[2024-09-09 16:39:48,964] [INFO] [RANK 0] deepspeed_activation_checkpointing True
[2024-09-09 16:39:48,964] [INFO] [RANK 0] master_ip .................... localhost
[2024-09-09 16:39:48,964] [INFO] [RANK 0] master_port .................. 39375
[2024-09-09 16:39:48,964] [INFO] [RANK 0] log_config ................... [{'model': {'scale_factor': 1.15258426, 'disable_first_stage_autocast': True, 'not_trainable_prefixes': ['all'], 'log_keys': ['txt'], 'denoiser_config': {'target': 'sgm.modules.diffusionmodules.denoiser.DiscreteDenoiser', 'params': {'num_idx': 1000, 'quantize_c_noise': False, 'weighting_config': {'target': 'sgm.modules.diffusionmodules.denoiser_weighting.EpsWeighting'}, 'scaling_config': {'target': 'sgm.modules.diffusionmodules.denoiser_scaling.VideoScaling'}, 'discretization_config': {'target': 'sgm.modules.diffusionmodules.discretizer.ZeroSNRDDPMDiscretization', 'params': {'shift_scale': 3.0}}}}, 'network_config': {'target': 'dit_video_concat.DiffusionTransformer', 'params': {'time_embed_dim': 512, 'elementwise_affine': True, 'num_frames': 49, 'time_compressed_rate': 4, 'latent_width': 90, 'latent_height': 60, 'num_layers': 30, 'patch_size': 2, 'in_channels': 16, 'out_channels': 16, 'hidden_size': 1920, 'adm_in_channels': 256, 'num_attention_heads': 30, 'transformer_args': {'checkpoint_activations': True, 'vocab_size': 1, 'max_sequence_length': 64, 'layernorm_order': 'pre', 'skip_init': False, 'model_parallel_size': 1, 'is_decoder': False}, 'modules': {'pos_embed_config': {'target': 'dit_video_concat.Basic3DPositionEmbeddingMixin', 'params': {'text_length': 226, 'height_interpolation': 1.875, 'width_interpolation': 1.875}}, 'lora_config': {'target': 'sat.model.finetune.lora2.LoraMixin', 'params': {'r': 128}}, 'patch_embed_config': {'target': 'dit_video_concat.ImagePatchEmbeddingMixin', 'params': {'text_hidden_size': 4096}}, 'adaln_layer_config': {'target': 'dit_video_concat.AdaLNMixin', 'params': {'qk_ln': True}}, 'final_layer_config': {'target': 'dit_video_concat.FinalLayerMixin'}}}}, 'conditioner_config': {'target': 'sgm.modules.GeneralConditioner', 'params': {'emb_models': [{'is_trainable': False, 'input_key': 'txt', 'ucg_rate': 0.1, 'target': 'sgm.modules.encoders.modules.FrozenT5Embedder', 'params': {'model_dir': '/root/CogVideo/t5-v1_1-xxl', 'max_length': 226}}]}}, 'first_stage_config': {'target': 'vae_modules.autoencoder.VideoAutoencoderInferenceWrapper', 'params': {'cp_size': 1, 'ckpt_path': '/root/CogVideo/CogVideoX-2b-sat/vae/3d-vae.pt', 'ignore_keys': ['loss'], 'loss_config': {'target': 'torch.nn.Identity'}, 'regularizer_config': {'target': 'vae_modules.regularizers.DiagonalGaussianRegularizer'}, 'encoder_config': {'target': 'vae_modules.cp_enc_dec.ContextParallelEncoder3D', 'params': {'double_z': True, 'z_channels': 16, 'resolution': 256, 'in_channels': 3, 'out_ch': 3, 'ch': 128, 'ch_mult': [1, 2, 2, 4], 'attn_resolutions': [], 'num_res_blocks': 3, 'dropout': 0.0, 'gather_norm': True}}, 'decoder_config': {'target': 'vae_modules.cp_enc_dec.ContextParallelDecoder3D', 'params': {'double_z': True, 'z_channels': 16, 'resolution': 256, 'in_channels': 3, 'out_ch': 3, 'ch': 128, 'ch_mult': [1, 2, 2, 4], 'attn_resolutions': [], 'num_res_blocks': 3, 'dropout': 0.0, 'gather_norm': False}}}}, 'loss_fn_config': {'target': 'sgm.modules.diffusionmodules.loss.VideoDiffusionLoss', 'params': {'offset_noise_level': 0, 'sigma_sampler_config': {'target': 'sgm.modules.diffusionmodules.sigma_sampling.DiscreteSampling', 'params': {'uniform_sampling': True, 'num_idx': 1000, 'discretization_config': {'target': 'sgm.modules.diffusionmodules.discretizer.ZeroSNRDDPMDiscretization', 'params': {'shift_scale': 3.0}}}}}}, 'sampler_config': {'target': 'sgm.modules.diffusionmodules.sampling.VPSDEDPMPP2MSampler', 'params': {'num_steps': 50, 'verbose': True, 'discretization_config': {'target': 'sgm.modules.diffusionmodules.discretizer.ZeroSNRDDPMDiscretization', 'params': {'shift_scale': 3.0}}, 'guider_config': {'target': 'sgm.modules.diffusionmodules.guiders.DynamicCFG', 'params': {'scale': 6, 'exp': 5, 'num_steps': 50}}}}}}, {'args': {'checkpoint_activations': True, 'model_parallel_size': 1, 'experiment_name': 'lora-test', 'mode': 'finetune', 'load': '/root/CogVideo/CogVideoX-2b-sat/transformer', 'no_load_rng': True, 'train_iters': 100, 'eval_iters': 1, 'eval_interval': 10, 'eval_batch_size': 1, 'save': 'ckpts_2b_lora', 'save_interval': 50, 'log_interval': 20, 'train_data': ['/root/CogVideo/sat/datasets/test'], 'valid_data': ['/root/CogVideo/sat/datasets/test'], 'split': '1,0,0', 'num_workers': 8, 'force_train': True, 'only_log_video_latents': True}, 'data': {'target': 'data_video.SFTDataset', 'params': {'video_size': [480, 720], 'fps': 8, 'max_num_frames': 49, 'skip_frms_num': 3.0}}, 'deepspeed': {'train_micro_batch_size_per_gpu': 2, 'gradient_accumulation_steps': 1, 'steps_per_print': 50, 'gradient_clipping': 0.1, 'zero_optimization': {'stage': 2, 'cpu_offload': False, 'contiguous_gradients': False, 'overlap_comm': True, 'reduce_scatter': True, 'reduce_bucket_size': 1000000000, 'allgather_bucket_size': 1000000000, 'load_from_fp32_weights': False}, 'zero_allow_untested_optimizer': True, 'bf16': {'enabled': False}, 'fp16': {'enabled': True}, 'loss_scale': 0, 'loss_scale_window': 400, 'hysteresis': 2, 'min_loss_scale': 1, 'optimizer': {'type': 'sat.ops.FusedEmaAdam', 'params': {'lr': 0.001, 'betas': [0.9, 0.95], 'eps': '1e-8', 'weight_decay': '1e-4'}}, 'activation_checkpointing': {'partition_activations': False, 'contiguous_memory_optimization': False}, 'wall_clock_breakdown': False}}]
[2024-09-09 16:39:48,964] [INFO] [RANK 0] do_train ..................... True
[2024-09-09 16:39:48,964] [INFO] [RANK 0] val_last_shape ............... []
[2024-09-09 16:39:48,964] [INFO] [RANK 0] val_drop_number .............. 0
[2024-09-09 16:39:48,964] [INFO] [RANK 0] do_valid ..................... True
[2024-09-09 16:39:48,964] [INFO] [RANK 0] do_test ...................... False
[2024-09-09 16:39:48,964] [INFO] [RANK 0] iteration .................... 0
[2024-09-09 16:40:39,276] [INFO] [checkpointing.py:541:forward] Activation Checkpointing Information
[2024-09-09 16:40:39,276] [INFO] [checkpointing.py:542:forward] ----Partition Activations False, CPU CHECKPOINTING False
[2024-09-09 16:40:39,276] [INFO] [checkpointing.py:543:forward] ----contiguous Memory Checkpointing False with None total layers
[2024-09-09 16:40:39,276] [INFO] [checkpointing.py:545:forward] ----Synchronization False
[2024-09-09 16:40:39,276] [INFO] [checkpointing.py:546:forward] ----Profiling time in checkpointing False
[2024-09-09 16:40:49,239] [INFO] [loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 4294967296, reducing to 2147483648
[2024-09-09 16:41:14,908] [INFO] [loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 2147483648, reducing to 1073741824
[rank0]: Traceback (most recent call last):
[rank0]: File "/root/CogVideo/sat/train_video.py", line 226, in <module>
[rank0]: training_main(
[rank0]: File "/root/miniconda3/envs/cogvideo/lib/python3.12/site-packages/sat/training/deepspeed_training.py", line 157, in training_main
[rank0]: iteration, skipped = train(model, optimizer,
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/root/miniconda3/envs/cogvideo/lib/python3.12/site-packages/sat/training/deepspeed_training.py", line 359, in train
[rank0]: lm_loss, skipped_iter, metrics = train_step(train_data_iterator,
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/root/miniconda3/envs/cogvideo/lib/python3.12/site-packages/sat/training/deepspeed_training.py", line 443, in train_step
[rank0]: forward_ret = forward_step(data_iterator, model, args, timers, **kwargs)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/root/CogVideo/sat/train_video.py", line 176, in forward_step
[rank0]: batch = next(data_iterator)
[rank0]: ^^^^^^^^^^^^^^^^^^^
[rank0]: File "/root/miniconda3/envs/cogvideo/lib/python3.12/site-packages/torch/utils/data/dataloader.py", line 630, in __next__
[rank0]: data = self._next_data()
[rank0]: ^^^^^^^^^^^^^^^^^
[rank0]: File "/root/miniconda3/envs/cogvideo/lib/python3.12/site-packages/torch/utils/data/dataloader.py", line 1324, in _next_data
[rank0]: return self._process_data(data)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/root/miniconda3/envs/cogvideo/lib/python3.12/site-packages/torch/utils/data/dataloader.py", line 1370, in _process_data
[rank0]: data.reraise()
[rank0]: File "/root/miniconda3/envs/cogvideo/lib/python3.12/site-packages/torch/_utils.py", line 706, in reraise
[rank0]: raise exception
[rank0]: ZeroDivisionError: Caught ZeroDivisionError in DataLoader worker process 2.
[rank0]: Original Traceback (most recent call last):
[rank0]: File "/root/miniconda3/envs/cogvideo/lib/python3.12/site-packages/torch/utils/data/_utils/worker.py", line 309, in _worker_loop
[rank0]: data = fetcher.fetch(index) # type: ignore[possibly-undefined]
[rank0]: ^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/root/miniconda3/envs/cogvideo/lib/python3.12/site-packages/torch/utils/data/_utils/fetch.py", line 52, in fetch
[rank0]: data = [self.dataset[idx] for idx in possibly_batched_index]
[rank0]: ~~~~~~~~~~~~^^^^^
[rank0]: File "/root/miniconda3/envs/cogvideo/lib/python3.12/site-packages/sat/data_utils/configure_data.py", line 360, in __getitem__
[rank0]: return self.wrapped_data[index]
[rank0]: ~~~~~~~~~~~~~~~~~^^^^^^^
[rank0]: File "/root/miniconda3/envs/cogvideo/lib/python3.12/site-packages/sat/data_utils/configure_data.py", line 342, in __getitem__
[rank0]: return self.datasets[dataset_idx][sample_idx]
[rank0]: ~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^
[rank0]: File "/root/CogVideo/sat/data_video.py", line 411, in __getitem__
[rank0]: indices = np.arange(start, end, (end - start) // num_frames).astype(int)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: ZeroDivisionError: division by zero
[2nd Trial] Selecting videos with no more than 50 frames
(cogvideo) root@alphacode-ttv-a100-80g-gpu:~/CogVideo/sat# bash finetune_single_gpu.sh
RUN on alphacode-ttv-a100-80g-gpu, CUDA_VISIBLE_DEVICES=
WORLD_SIZE=1 RANK=0 LOCAL_RANK=0 LOCAL_WORLD_SIZE=1 python train_video.py --base configs/cogvideox_2b_lora.yaml configs/sft.yaml --seed 5243
[2024-09-09 16:57:30,500] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[WARNING] async_io requires the dev libaio .so object and headers but these were not found.
[WARNING] async_io: please install the libaio-dev package with apt
[WARNING] If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
[WARNING] Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
[WARNING] sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.4
[WARNING] using untested triton version (3.0.0), only 1.0.0 is known to be compatible
/root/miniconda3/envs/cogvideo/lib/python3.12/site-packages/deepspeed/runtime/zero/linear.py:47: FutureWarning: `torch.cuda.amp.custom_fwd(args...)` is deprecated. Please use `torch.amp.custom_fwd(args..., device_type='cuda')` instead.
@autocast_custom_fwd
/root/miniconda3/envs/cogvideo/lib/python3.12/site-packages/deepspeed/runtime/zero/linear.py:66: FutureWarning: `torch.cuda.amp.custom_bwd(args...)` is deprecated. Please use `torch.amp.custom_bwd(args..., device_type='cuda')` instead.
@autocast_custom_bwd
/root/miniconda3/envs/cogvideo/lib/python3.12/site-packages/kornia/feature/lightglue.py:44: FutureWarning: `torch.cuda.amp.custom_fwd(args...)` is deprecated. Please use `torch.amp.custom_fwd(args..., device_type='cuda')` instead.
@torch.cuda.amp.custom_fwd(cast_inputs=torch.float32)
no module 'xformers'. Processing without...
no module 'xformers'. Processing without...
[2024-09-09 16:57:35,259] [INFO] using world size: 1
[2024-09-09 16:57:35,259] [INFO] Will override arguments with manually specified deepspeed_config!
[W909 16:57:35.100558963 socket.cpp:697] [c10d] The client socket cannot be initialized to connect to [ip6-localhost]:57495 (errno: 97 - Address family not supported by protocol).
[W909 16:57:35.104642776 socket.cpp:697] [c10d] The client socket cannot be initialized to connect to [alphacode-ttv-a100-80g-gpu]:57495 (errno: 97 - Address family not supported by protocol).
[2024-09-09 16:57:35,282] [INFO] [RANK 0] > initializing model parallel with size 1
[2024-09-09 16:57:35,283] [INFO] [comm.py:637:init_distributed] cdb=None
[2024-09-09 16:57:35,516] [INFO] [RANK 0] building SATVideoDiffusionEngine model ...
[2024-09-09 16:57:44,744] [WARNING] [RANK 0] Failed to load bitsandbytes:No module named 'bitsandbytes'
[2024-09-09 16:57:44,744] [INFO] [RANK 0] replacing layer 0 attention with lora
[2024-09-09 16:57:44,781] [INFO] [RANK 0] replacing layer 1 attention with lora
[2024-09-09 16:57:44,816] [INFO] [RANK 0] replacing layer 2 attention with lora
[2024-09-09 16:57:44,841] [INFO] [RANK 0] replacing layer 3 attention with lora
[2024-09-09 16:57:44,863] [INFO] [RANK 0] replacing layer 4 attention with lora
[2024-09-09 16:57:44,885] [INFO] [RANK 0] replacing layer 5 attention with lora
[2024-09-09 16:57:44,907] [INFO] [RANK 0] replacing layer 6 attention with lora
[2024-09-09 16:57:44,982] [INFO] [RANK 0] replacing layer 7 attention with lora
[2024-09-09 16:57:45,090] [INFO] [RANK 0] replacing layer 8 attention with lora
[2024-09-09 16:57:45,159] [INFO] [RANK 0] replacing layer 9 attention with lora
[2024-09-09 16:57:45,273] [INFO] [RANK 0] replacing layer 10 attention with lora
[2024-09-09 16:57:45,422] [INFO] [RANK 0] replacing layer 11 attention with lora
[2024-09-09 16:57:45,550] [INFO] [RANK 0] replacing layer 12 attention with lora
[2024-09-09 16:57:45,658] [INFO] [RANK 0] replacing layer 13 attention with lora
[2024-09-09 16:57:45,774] [INFO] [RANK 0] replacing layer 14 attention with lora
[2024-09-09 16:57:45,905] [INFO] [RANK 0] replacing layer 15 attention with lora
[2024-09-09 16:57:46,027] [INFO] [RANK 0] replacing layer 16 attention with lora
[2024-09-09 16:57:46,102] [INFO] [RANK 0] replacing layer 17 attention with lora
[2024-09-09 16:57:46,195] [INFO] [RANK 0] replacing layer 18 attention with lora
[2024-09-09 16:57:46,302] [INFO] [RANK 0] replacing layer 19 attention with lora
[2024-09-09 16:57:46,347] [INFO] [RANK 0] replacing layer 20 attention with lora
[2024-09-09 16:57:46,375] [INFO] [RANK 0] replacing layer 21 attention with lora
[2024-09-09 16:57:46,397] [INFO] [RANK 0] replacing layer 22 attention with lora
[2024-09-09 16:57:46,419] [INFO] [RANK 0] replacing layer 23 attention with lora
[2024-09-09 16:57:46,440] [INFO] [RANK 0] replacing layer 24 attention with lora
[2024-09-09 16:57:46,461] [INFO] [RANK 0] replacing layer 25 attention with lora
[2024-09-09 16:57:46,483] [INFO] [RANK 0] replacing layer 26 attention with lora
[2024-09-09 16:57:46,504] [INFO] [RANK 0] replacing layer 27 attention with lora
[2024-09-09 16:57:46,526] [INFO] [RANK 0] replacing layer 28 attention with lora
[2024-09-09 16:57:46,547] [INFO] [RANK 0] replacing layer 29 attention with lora
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:02<00:00, 1.01s/it]
Initialized embedder #0: FrozenT5Embedder with 4762310656 params. Trainable: False
Working with z of shape (1, 16, 32, 32) = 16384 dimensions.
/root/CogVideo/sat/vae_modules/autoencoder.py:565: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature.
sd = torch.load(path, map_location="cpu")["state_dict"]
Deleting key loss.logvar from state_dict.
Deleting key loss.perceptual_loss.scaling_layer.shift from state_dict.
Deleting key loss.perceptual_loss.scaling_layer.scale from state_dict.
Deleting key loss.perceptual_loss.net.slice1.0.weight from state_dict.
Deleting key loss.perceptual_loss.net.slice1.0.bias from state_dict.
Deleting key loss.perceptual_loss.net.slice1.2.weight from state_dict.
Deleting key loss.perceptual_loss.net.slice1.2.bias from state_dict.
Deleting key loss.perceptual_loss.net.slice2.5.weight from state_dict.
Deleting key loss.perceptual_loss.net.slice2.5.bias from state_dict.
Deleting key loss.perceptual_loss.net.slice2.7.weight from state_dict.
Deleting key loss.perceptual_loss.net.slice2.7.bias from state_dict.
Deleting key loss.perceptual_loss.net.slice3.10.weight from state_dict.
Deleting key loss.perceptual_loss.net.slice3.10.bias from state_dict.
Deleting key loss.perceptual_loss.net.slice3.12.weight from state_dict.
Deleting key loss.perceptual_loss.net.slice3.12.bias from state_dict.
Deleting key loss.perceptual_loss.net.slice3.14.weight from state_dict.
Deleting key loss.perceptual_loss.net.slice3.14.bias from state_dict.
Deleting key loss.perceptual_loss.net.slice4.17.weight from state_dict.
Deleting key loss.perceptual_loss.net.slice4.17.bias from state_dict.
Deleting key loss.perceptual_loss.net.slice4.19.weight from state_dict.
Deleting key loss.perceptual_loss.net.slice4.19.bias from state_dict.
Deleting key loss.perceptual_loss.net.slice4.21.weight from state_dict.
Deleting key loss.perceptual_loss.net.slice4.21.bias from state_dict.
Deleting key loss.perceptual_loss.net.slice5.24.weight from state_dict.
Deleting key loss.perceptual_loss.net.slice5.24.bias from state_dict.
Deleting key loss.perceptual_loss.net.slice5.26.weight from state_dict.
Deleting key loss.perceptual_loss.net.slice5.26.bias from state_dict.
Deleting key loss.perceptual_loss.net.slice5.28.weight from state_dict.
Deleting key loss.perceptual_loss.net.slice5.28.bias from state_dict.
Deleting key loss.perceptual_loss.lin0.model.1.weight from state_dict.
Deleting key loss.perceptual_loss.lin1.model.1.weight from state_dict.
Deleting key loss.perceptual_loss.lin2.model.1.weight from state_dict.
Deleting key loss.perceptual_loss.lin3.model.1.weight from state_dict.
Deleting key loss.perceptual_loss.lin4.model.1.weight from state_dict.
Deleting key loss.discriminator.blocks.0.downsample_res.conv.weight from state_dict.
Deleting key loss.discriminator.blocks.0.downsample_res.conv.bias from state_dict.
Deleting key loss.discriminator.blocks.0.net.0.conv.weight from state_dict.
Deleting key loss.discriminator.blocks.0.net.0.conv.bias from state_dict.
Deleting key loss.discriminator.blocks.0.net.2.conv.weight from state_dict.
Deleting key loss.discriminator.blocks.0.net.2.conv.bias from state_dict.
Deleting key loss.discriminator.blocks.0.downsample.conv.weight from state_dict.
Deleting key loss.discriminator.blocks.0.downsample.conv.bias from state_dict.
Deleting key loss.discriminator.blocks.1.downsample_res.conv.weight from state_dict.
Deleting key loss.discriminator.blocks.1.downsample_res.conv.bias from state_dict.
Deleting key loss.discriminator.blocks.1.net.0.conv.weight from state_dict.
Deleting key loss.discriminator.blocks.1.net.0.conv.bias from state_dict.
Deleting key loss.discriminator.blocks.1.net.2.conv.weight from state_dict.
Deleting key loss.discriminator.blocks.1.net.2.conv.bias from state_dict.
Deleting key loss.discriminator.blocks.1.downsample.conv.weight from state_dict.
Deleting key loss.discriminator.blocks.1.downsample.conv.bias from state_dict.
Deleting key loss.discriminator.blocks.2.downsample_res.conv.weight from state_dict.
Deleting key loss.discriminator.blocks.2.downsample_res.conv.bias from state_dict.
Deleting key loss.discriminator.blocks.2.net.0.conv.weight from state_dict.
Deleting key loss.discriminator.blocks.2.net.0.conv.bias from state_dict.
Deleting key loss.discriminator.blocks.2.net.2.conv.weight from state_dict.
Deleting key loss.discriminator.blocks.2.net.2.conv.bias from state_dict.
Deleting key loss.discriminator.blocks.2.downsample.conv.weight from state_dict.
Deleting key loss.discriminator.blocks.2.downsample.conv.bias from state_dict.
Deleting key loss.discriminator.blocks.3.downsample_res.conv.weight from state_dict.
Deleting key loss.discriminator.blocks.3.downsample_res.conv.bias from state_dict.
Deleting key loss.discriminator.blocks.3.net.0.conv.weight from state_dict.
Deleting key loss.discriminator.blocks.3.net.0.conv.bias from state_dict.
Deleting key loss.discriminator.blocks.3.net.2.conv.weight from state_dict.
Deleting key loss.discriminator.blocks.3.net.2.conv.bias from state_dict.
Deleting key loss.discriminator.blocks.3.downsample.conv.weight from state_dict.
Deleting key loss.discriminator.blocks.3.downsample.conv.bias from state_dict.
Deleting key loss.discriminator.blocks.4.0.conv_res.weight from state_dict.
Deleting key loss.discriminator.blocks.4.0.conv_res.bias from state_dict.
Deleting key loss.discriminator.blocks.4.0.net.0.weight from state_dict.
Deleting key loss.discriminator.blocks.4.0.net.0.bias from state_dict.
Deleting key loss.discriminator.blocks.4.0.net.2.weight from state_dict.
Deleting key loss.discriminator.blocks.4.0.net.2.bias from state_dict.
Deleting key loss.discriminator.blocks.4.0.downsample.1.weight from state_dict.
Deleting key loss.discriminator.blocks.4.0.downsample.1.bias from state_dict.
Deleting key loss.discriminator.blocks.4.1.0.fn.norm.gamma from state_dict.
Deleting key loss.discriminator.blocks.4.1.0.fn.attn.to_q.0.weight from state_dict.
Deleting key loss.discriminator.blocks.4.1.0.fn.attn.to_kv.0.weight from state_dict.
Deleting key loss.discriminator.blocks.4.1.0.fn.attn.to_out.0.weight from state_dict.
Deleting key loss.discriminator.blocks.4.1.1.fn.norm.gamma from state_dict.
Deleting key loss.discriminator.blocks.4.1.1.fn.net.0.weight from state_dict.
Deleting key loss.discriminator.blocks.4.1.1.fn.net.0.bias from state_dict.
Deleting key loss.discriminator.blocks.4.1.1.fn.net.2.weight from state_dict.
Deleting key loss.discriminator.blocks.4.1.1.fn.net.2.bias from state_dict.
Deleting key loss.discriminator.blocks.5.0.conv_res.weight from state_dict.
Deleting key loss.discriminator.blocks.5.0.conv_res.bias from state_dict.
Deleting key loss.discriminator.blocks.5.0.net.0.weight from state_dict.
Deleting key loss.discriminator.blocks.5.0.net.0.bias from state_dict.
Deleting key loss.discriminator.blocks.5.0.net.2.weight from state_dict.
Deleting key loss.discriminator.blocks.5.0.net.2.bias from state_dict.
Deleting key loss.discriminator.blocks.5.0.downsample.1.weight from state_dict.
Deleting key loss.discriminator.blocks.5.0.downsample.1.bias from state_dict.
Deleting key loss.discriminator.blocks.5.1.0.fn.norm.gamma from state_dict.
Deleting key loss.discriminator.blocks.5.1.0.fn.attn.to_q.0.weight from state_dict.
Deleting key loss.discriminator.blocks.5.1.0.fn.attn.to_kv.0.weight from state_dict.
Deleting key loss.discriminator.blocks.5.1.0.fn.attn.to_out.0.weight from state_dict.
Deleting key loss.discriminator.blocks.5.1.1.fn.norm.gamma from state_dict.
Deleting key loss.discriminator.blocks.5.1.1.fn.net.0.weight from state_dict.
Deleting key loss.discriminator.blocks.5.1.1.fn.net.0.bias from state_dict.
Deleting key loss.discriminator.blocks.5.1.1.fn.net.2.weight from state_dict.
Deleting key loss.discriminator.blocks.5.1.1.fn.net.2.bias from state_dict.
Deleting key loss.discriminator.blocks.6.0.conv_res.weight from state_dict.
Deleting key loss.discriminator.blocks.6.0.conv_res.bias from state_dict.
Deleting key loss.discriminator.blocks.6.0.net.0.weight from state_dict.
Deleting key loss.discriminator.blocks.6.0.net.0.bias from state_dict.
Deleting key loss.discriminator.blocks.6.0.net.2.weight from state_dict.
Deleting key loss.discriminator.blocks.6.0.net.2.bias from state_dict.
Deleting key loss.discriminator.blocks.6.1.0.fn.norm.gamma from state_dict.
Deleting key loss.discriminator.blocks.6.1.0.fn.attn.to_q.0.weight from state_dict.
Deleting key loss.discriminator.blocks.6.1.0.fn.attn.to_kv.0.weight from state_dict.
Deleting key loss.discriminator.blocks.6.1.0.fn.attn.to_out.0.weight from state_dict.
Deleting key loss.discriminator.blocks.6.1.1.fn.norm.gamma from state_dict.
Deleting key loss.discriminator.blocks.6.1.1.fn.net.0.weight from state_dict.
Deleting key loss.discriminator.blocks.6.1.1.fn.net.0.bias from state_dict.
Deleting key loss.discriminator.blocks.6.1.1.fn.net.2.weight from state_dict.
Deleting key loss.discriminator.blocks.6.1.1.fn.net.2.bias from state_dict.
Deleting key loss.discriminator.to_logits.0.weight from state_dict.
Deleting key loss.discriminator.to_logits.0.bias from state_dict.
Deleting key loss.discriminator.to_logits.3.weight from state_dict.
Deleting key loss.discriminator.to_logits.3.bias from state_dict.
Missing keys: []
Unexpected keys: []
Restored from /root/CogVideo/CogVideoX-2b-sat/vae/3d-vae.pt
[2024-09-09 16:57:50,806] [INFO] [RANK 0] > number of parameters on model parallel rank 0: 6764790755
[2024-09-09 16:58:00,971] [INFO] [RANK 0] global rank 0 is loading checkpoint /root/CogVideo/CogVideoX-2b-sat/transformer/1000/mp_rank_00_model_states.pt
/root/miniconda3/envs/cogvideo/lib/python3.12/site-packages/sat/training/model_io.py:286: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature.
sd = torch.load(checkpoint_name, map_location='cpu')
[2024-09-09 16:58:02,528] [INFO] [RANK 0] > successfully loaded /root/CogVideo/CogVideoX-2b-sat/transformer/1000/mp_rank_00_model_states.pt
[2024-09-09 16:58:03,506] [INFO] [RANK 0] ***** Total trainable parameters: 58982400 *****
[2024-09-09 16:58:03,506] [INFO] [RANK 0] [<class 'sat.ops.layernorm.LayerNorm'>, <class 'torch.nn.modules.normalization.LayerNorm'>, <class 'sat.ops.layernorm.RMSNorm'>] is set to no_weight_decay
[2024-09-09 16:58:03,509] [INFO] [RANK 0] Syncing initialized parameters...
[2024-09-09 16:58:03,623] [INFO] [RANK 0] Finished syncing initialized parameters.
[2024-09-09 16:58:03,624] [INFO] [RANK 0] Using optimizer sat.ops.FusedEmaAdam from sat.
[2024-09-09 16:58:03,624] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed info: version=0.14.4, git-hash=unknown, git-branch=unknown
[2024-09-09 16:58:03,625] [WARNING] [config_utils.py:69:_process_deprecated_field] Config parameter cpu_offload is deprecated use offload_optimizer instead
[2024-09-09 16:58:03,717] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Flops Profiler Enabled: False
Using /root/.cache/torch_extensions/py312_cu121 as PyTorch extensions root...
Detected CUDA files, patching ldflags
Emitting ninja build file /root/.cache/torch_extensions/py312_cu121/fused_ema_adam/build.ninja...
/root/miniconda3/envs/cogvideo/lib/python3.12/site-packages/torch/utils/cpp_extension.py:1965: UserWarning: TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation.
If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'].
warnings.warn(
Building extension module fused_ema_adam...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module fused_ema_adam...
Time to load fused_ema_adam op: 0.6912670135498047 seconds
[2024-09-09 16:58:04,567] [INFO] [logging.py:96:log_dist] [Rank 0] Using client callable to create basic optimizer
[2024-09-09 16:58:04,567] [INFO] [logging.py:96:log_dist] [Rank 0] Removing param_group that has no 'params' in the basic Optimizer
[2024-09-09 16:58:04,587] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Basic Optimizer = FusedEmaAdam
[2024-09-09 16:58:04,587] [INFO] [utils.py:56:is_zero_supported_optimizer] Checking ZeRO support for optimizer=FusedEmaAdam type=<class 'sat.ops.fused_ema_adam.FusedEmaAdam'>
[2024-09-09 16:58:04,587] [WARNING] [engine.py:1179:_do_optimizer_sanity_check] **** You are using ZeRO with an untested optimizer, proceed with caution *****
[2024-09-09 16:58:04,587] [INFO] [logging.py:96:log_dist] [Rank 0] Creating torch.float16 ZeRO stage 2 optimizer
[2024-09-09 16:58:04,587] [INFO] [stage_1_and_2.py:148:__init__] Reduce bucket size 1000000000
[2024-09-09 16:58:04,587] [INFO] [stage_1_and_2.py:149:__init__] Allgather bucket size 1000000000
[2024-09-09 16:58:04,587] [INFO] [stage_1_and_2.py:150:__init__] CPU Offload: False
[2024-09-09 16:58:04,587] [INFO] [stage_1_and_2.py:151:__init__] Round robin gradient partitioning: False
[2024-09-09 16:58:06,802] [INFO] [utils.py:781:see_memory_usage] Before initializing optimizer states
[2024-09-09 16:58:06,803] [INFO] [utils.py:782:see_memory_usage] MA 12.86 GB Max_MA 12.97 GB CA 13.23 GB Max_CA 13 GB
[2024-09-09 16:58:06,803] [INFO] [utils.py:789:see_memory_usage] CPU Virtual Memory: used = 32.85 GB, percent = 1.7%
[2024-09-09 16:58:07,025] [INFO] [utils.py:781:see_memory_usage] After initializing optimizer states
[2024-09-09 16:58:07,025] [INFO] [utils.py:782:see_memory_usage] MA 12.86 GB Max_MA 13.08 GB CA 13.45 GB Max_CA 13 GB
[2024-09-09 16:58:07,025] [INFO] [utils.py:789:see_memory_usage] CPU Virtual Memory: used = 32.71 GB, percent = 1.7%
[2024-09-09 16:58:07,025] [INFO] [stage_1_and_2.py:543:__init__] optimizer state initialized
[2024-09-09 16:58:07,246] [INFO] [utils.py:781:see_memory_usage] After initializing ZeRO optimizer
[2024-09-09 16:58:07,246] [INFO] [utils.py:782:see_memory_usage] MA 12.86 GB Max_MA 12.86 GB CA 13.45 GB Max_CA 13 GB
[2024-09-09 16:58:07,246] [INFO] [utils.py:789:see_memory_usage] CPU Virtual Memory: used = 32.93 GB, percent = 1.7%
[2024-09-09 16:58:07,251] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Final Optimizer = DeepSpeedZeroOptimizer
[2024-09-09 16:58:07,251] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed using client LR scheduler
[2024-09-09 16:58:07,251] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed LR Scheduler = None
[2024-09-09 16:58:07,251] [INFO] [logging.py:96:log_dist] [Rank 0] step=0, skipped=0, lr=[1.0], mom=[[0.9, 0.95]]
[2024-09-09 16:58:07,254] [INFO] [config.py:997:print] DeepSpeedEngine configuration:
[2024-09-09 16:58:07,254] [INFO] [config.py:1001:print] activation_checkpointing_config {
"partition_activations": false,
"contiguous_memory_optimization": false,
"cpu_checkpointing": false,
"number_checkpoints": null,
"synchronize_checkpoint_boundary": false,
"profile": false
}
[2024-09-09 16:58:07,254] [INFO] [config.py:1001:print] aio_config ................... {'block_size': 1048576, 'queue_depth': 8, 'thread_count': 1, 'single_submit': False, 'overlap_events': True}
[2024-09-09 16:58:07,254] [INFO] [config.py:1001:print] amp_enabled .................. False
[2024-09-09 16:58:07,254] [INFO] [config.py:1001:print] amp_params ................... False
[2024-09-09 16:58:07,255] [INFO] [config.py:1001:print] autotuning_config ............ {
"enabled": false,
"start_step": null,
"end_step": null,
"metric_path": null,
"arg_mappings": null,
"metric": "throughput",
"model_info": null,
"results_dir": "autotuning_results",
"exps_dir": "autotuning_exps",
"overwrite": true,
"fast": true,
"start_profile_step": 3,
"end_profile_step": 5,
"tuner_type": "gridsearch",
"tuner_early_stopping": 5,
"tuner_num_trials": 50,
"model_info_path": null,
"mp_size": 1,
"max_train_batch_size": null,
"min_train_batch_size": 1,
"max_train_micro_batch_size_per_gpu": 1.024000e+03,
"min_train_micro_batch_size_per_gpu": 1,
"num_tuning_micro_batch_sizes": 3
}
[2024-09-09 16:58:07,255] [INFO] [config.py:1001:print] bfloat16_enabled ............. False
[2024-09-09 16:58:07,255] [INFO] [config.py:1001:print] bfloat16_immediate_grad_update False
[2024-09-09 16:58:07,255] [INFO] [config.py:1001:print] checkpoint_parallel_write_pipeline False
[2024-09-09 16:58:07,255] [INFO] [config.py:1001:print] checkpoint_tag_validation_enabled True
[2024-09-09 16:58:07,255] [INFO] [config.py:1001:print] checkpoint_tag_validation_fail False
[2024-09-09 16:58:07,255] [INFO] [config.py:1001:print] comms_config ................. <deepspeed.comm.config.DeepSpeedCommsConfig object at 0x7fcc80151100>
[2024-09-09 16:58:07,255] [INFO] [config.py:1001:print] communication_data_type ...... None
[2024-09-09 16:58:07,255] [INFO] [config.py:1001:print] compression_config ........... {'weight_quantization': {'shared_parameters': {'enabled': False, 'quantizer_kernel': False, 'schedule_offset': 0, 'quantize_groups': 1, 'quantize_verbose': False, 'quantization_type': 'symmetric', 'quantize_weight_in_forward': False, 'rounding': 'nearest', 'fp16_mixed_quantize': False, 'quantize_change_ratio': 0.001}, 'different_groups': {}}, 'activation_quantization': {'shared_parameters': {'enabled': False, 'quantization_type': 'symmetric', 'range_calibration': 'dynamic', 'schedule_offset': 1000}, 'different_groups': {}}, 'sparse_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'row_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'head_pruning': {'shared_parameters': {'enabled': False, 'method': 'topk', 'schedule_offset': 1000}, 'different_groups': {}}, 'channel_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'layer_reduction': {'enabled': False}}
[2024-09-09 16:58:07,255] [INFO] [config.py:1001:print] curriculum_enabled_legacy .... False
[2024-09-09 16:58:07,255] [INFO] [config.py:1001:print] curriculum_params_legacy ..... False
[2024-09-09 16:58:07,255] [INFO] [config.py:1001:print] data_efficiency_config ....... {'enabled': False, 'seed': 1234, 'data_sampling': {'enabled': False, 'num_epochs': 1000, 'num_workers': 0, 'curriculum_learning': {'enabled': False}}, 'data_routing': {'enabled': False, 'random_ltd': {'enabled': False, 'layer_token_lr_schedule': {'enabled': False}}}}
[2024-09-09 16:58:07,255] [INFO] [config.py:1001:print] data_efficiency_enabled ...... False
[2024-09-09 16:58:07,255] [INFO] [config.py:1001:print] dataloader_drop_last ......... False
[2024-09-09 16:58:07,255] [INFO] [config.py:1001:print] disable_allgather ............ False
[2024-09-09 16:58:07,255] [INFO] [config.py:1001:print] dump_state ................... False
[2024-09-09 16:58:07,255] [INFO] [config.py:1001:print] dynamic_loss_scale_args ...... None
[2024-09-09 16:58:07,255] [INFO] [config.py:1001:print] eigenvalue_enabled ........... False
[2024-09-09 16:58:07,255] [INFO] [config.py:1001:print] eigenvalue_gas_boundary_resolution 1
[2024-09-09 16:58:07,255] [INFO] [config.py:1001:print] eigenvalue_layer_name ........ bert.encoder.layer
[2024-09-09 16:58:07,255] [INFO] [config.py:1001:print] eigenvalue_layer_num ......... 0
[2024-09-09 16:58:07,255] [INFO] [config.py:1001:print] eigenvalue_max_iter .......... 100
[2024-09-09 16:58:07,255] [INFO] [config.py:1001:print] eigenvalue_stability ......... 1e-06
[2024-09-09 16:58:07,255] [INFO] [config.py:1001:print] eigenvalue_tol ............... 0.01
[2024-09-09 16:58:07,255] [INFO] [config.py:1001:print] eigenvalue_verbose ........... False
[2024-09-09 16:58:07,255] [INFO] [config.py:1001:print] elasticity_enabled ........... False
[2024-09-09 16:58:07,255] [INFO] [config.py:1001:print] flops_profiler_config ........ {
"enabled": false,
"recompute_fwd_factor": 0.0,
"profile_step": 1,
"module_depth": -1,
"top_modules": 1,
"detailed": true,
"output_file": null
}
[2024-09-09 16:58:07,255] [INFO] [config.py:1001:print] fp16_auto_cast ............... False
[2024-09-09 16:58:07,255] [INFO] [config.py:1001:print] fp16_enabled ................. True
[2024-09-09 16:58:07,256] [INFO] [config.py:1001:print] fp16_master_weights_and_gradients False
[2024-09-09 16:58:07,256] [INFO] [config.py:1001:print] global_rank .................. 0
[2024-09-09 16:58:07,256] [INFO] [config.py:1001:print] grad_accum_dtype ............. None
[2024-09-09 16:58:07,256] [INFO] [config.py:1001:print] gradient_accumulation_steps .. 1
[2024-09-09 16:58:07,256] [INFO] [config.py:1001:print] gradient_clipping ............ 0.1
[2024-09-09 16:58:07,256] [INFO] [config.py:1001:print] gradient_predivide_factor .... 1.0
[2024-09-09 16:58:07,256] [INFO] [config.py:1001:print] graph_harvesting ............. False
[2024-09-09 16:58:07,256] [INFO] [config.py:1001:print] hybrid_engine ................ enabled=False max_out_tokens=512 inference_tp_size=1 release_inference_cache=False pin_parameters=True tp_gather_partition_size=8
[2024-09-09 16:58:07,256] [INFO] [config.py:1001:print] initial_dynamic_scale ........ 65536
[2024-09-09 16:58:07,256] [INFO] [config.py:1001:print] load_universal_checkpoint .... False
[2024-09-09 16:58:07,256] [INFO] [config.py:1001:print] loss_scale ................... 0
[2024-09-09 16:58:07,256] [INFO] [config.py:1001:print] memory_breakdown ............. False
[2024-09-09 16:58:07,256] [INFO] [config.py:1001:print] mics_hierarchial_params_gather False
[2024-09-09 16:58:07,256] [INFO] [config.py:1001:print] mics_shard_size .............. -1
[2024-09-09 16:58:07,256] [INFO] [config.py:1001:print] monitor_config ............... tensorboard=TensorBoardConfig(enabled=False, output_path='', job_name='DeepSpeedJobName') comet=CometConfig(enabled=False, samples_log_interval=100, project=None, workspace=None, api_key=None, experiment_name=None, experiment_key=None, online=None, mode=None) wandb=WandbConfig(enabled=False, group=None, team=None, project='deepspeed') csv_monitor=CSVConfig(enabled=False, output_path='', job_name='DeepSpeedJobName') enabled=False
[2024-09-09 16:58:07,256] [INFO] [config.py:1001:print] nebula_config ................ {
"enabled": false,
"persistent_storage_path": null,
"persistent_time_interval": 100,
"num_of_version_in_retention": 2,
"enable_nebula_load": true,
"load_path": null
}
[2024-09-09 16:58:07,256] [INFO] [config.py:1001:print] optimizer_legacy_fusion ...... False
[2024-09-09 16:58:07,256] [INFO] [config.py:1001:print] optimizer_name ............... None
[2024-09-09 16:58:07,256] [INFO] [config.py:1001:print] optimizer_params ............. None
[2024-09-09 16:58:07,256] [INFO] [config.py:1001:print] pipeline ..................... {'stages': 'auto', 'partition': 'best', 'seed_layers': False, 'activation_checkpoint_interval': 0, 'pipe_partitioned': True, 'grad_partitioned': True}
[2024-09-09 16:58:07,256] [INFO] [config.py:1001:print] pld_enabled .................. False
[2024-09-09 16:58:07,256] [INFO] [config.py:1001:print] pld_params ................... False
[2024-09-09 16:58:07,256] [INFO] [config.py:1001:print] prescale_gradients ........... False
[2024-09-09 16:58:07,256] [INFO] [config.py:1001:print] scheduler_name ............... None
[2024-09-09 16:58:07,256] [INFO] [config.py:1001:print] scheduler_params ............. None
[2024-09-09 16:58:07,256] [INFO] [config.py:1001:print] seq_parallel_communication_data_type torch.float32
[2024-09-09 16:58:07,256] [INFO] [config.py:1001:print] sparse_attention ............. None
[2024-09-09 16:58:07,256] [INFO] [config.py:1001:print] sparse_gradients_enabled ..... False
[2024-09-09 16:58:07,256] [INFO] [config.py:1001:print] steps_per_print .............. 50
[2024-09-09 16:58:07,256] [INFO] [config.py:1001:print] timers_config ................ enabled=True synchronized=True
[2024-09-09 16:58:07,256] [INFO] [config.py:1001:print] train_batch_size ............. 2
[2024-09-09 16:58:07,256] [INFO] [config.py:1001:print] train_micro_batch_size_per_gpu 2
[2024-09-09 16:58:07,256] [INFO] [config.py:1001:print] use_data_before_expert_parallel_ False
[2024-09-09 16:58:07,256] [INFO] [config.py:1001:print] use_node_local_storage ....... False
[2024-09-09 16:58:07,256] [INFO] [config.py:1001:print] wall_clock_breakdown ......... False
[2024-09-09 16:58:07,256] [INFO] [config.py:1001:print] weight_quantization_config ... None
[2024-09-09 16:58:07,256] [INFO] [config.py:1001:print] world_size ................... 1
[2024-09-09 16:58:07,256] [INFO] [config.py:1001:print] zero_allow_untested_optimizer True
[2024-09-09 16:58:07,257] [INFO] [config.py:1001:print] zero_config .................. stage=2 contiguous_gradients=False reduce_scatter=True reduce_bucket_size=1000000000 use_multi_rank_bucket_allreduce=True allgather_partitions=True allgather_bucket_size=1000000000 overlap_comm=True load_from_fp32_weights=False elastic_checkpoint=False offload_param=None offload_optimizer=None sub_group_size=1,000,000,000 cpu_offload_param=None cpu_offload_use_pin_memory=None prefetch_bucket_size=50,000,000 param_persistence_threshold=100,000 model_persistence_threshold=sys.maxsize max_live_parameters=1,000,000,000 max_reuse_distance=1,000,000,000 gather_16bit_weights_on_model_save=False use_all_reduce_for_fetch_params=False stage3_gather_fp16_weights_on_model_save=False ignore_unused_parameters=True legacy_stage1=False round_robin_gradients=False zero_hpz_partition_size=1 zero_quantized_weights=False zero_quantized_nontrainable_weights=False zero_quantized_gradients=False mics_shard_size=-1 mics_hierarchical_params_gather=False memory_efficient_linear=True pipeline_loading_checkpoint=False override_module_apply=True
[2024-09-09 16:58:07,257] [INFO] [config.py:1001:print] zero_enabled ................. True
[2024-09-09 16:58:07,257] [INFO] [config.py:1001:print] zero_force_ds_cpu_optimizer .. True
[2024-09-09 16:58:07,257] [INFO] [config.py:1001:print] zero_optimization_stage ...... 2
[2024-09-09 16:58:07,257] [INFO] [config.py:987:print_user_config] json = {
"train_micro_batch_size_per_gpu": 2,
"gradient_accumulation_steps": 1,
"steps_per_print": 50,
"gradient_clipping": 0.1,
"zero_optimization": {
"stage": 2,
"cpu_offload": false,
"contiguous_gradients": false,
"overlap_comm": true,
"reduce_scatter": true,
"reduce_bucket_size": 1.000000e+09,
"allgather_bucket_size": 1.000000e+09,
"load_from_fp32_weights": false
},
"zero_allow_untested_optimizer": true,
"bf16": {
"enabled": false
},
"fp16": {
"enabled": true
},
"loss_scale": 0,
"loss_scale_window": 400,
"hysteresis": 2,
"min_loss_scale": 1,
"activation_checkpointing": {
"partition_activations": false,
"contiguous_memory_optimization": false
},
"wall_clock_breakdown": false
}
[2024-09-09 16:58:07,257] [INFO] [RANK 0] learning rate decaying style linear, ratio 10.0
[2024-09-09 16:58:07,257] [INFO] [RANK 0] Finetuning Model...
[2024-09-09 16:58:07,257] [INFO] [RANK 0] arguments:
[2024-09-09 16:58:07,257] [INFO] [RANK 0] base ......................... ['configs/cogvideox_2b_lora.yaml', 'configs/sft.yaml']
[2024-09-09 16:58:07,257] [INFO] [RANK 0] model_parallel_size .......... 1
[2024-09-09 16:58:07,257] [INFO] [RANK 0] force_pretrain ............... False
[2024-09-09 16:58:07,257] [INFO] [RANK 0] device ....................... 0
[2024-09-09 16:58:07,257] [INFO] [RANK 0] debug ........................ False
[2024-09-09 16:58:07,257] [INFO] [RANK 0] log_image .................... True
[2024-09-09 16:58:07,257] [INFO] [RANK 0] output_dir ................... samples
[2024-09-09 16:58:07,257] [INFO] [RANK 0] input_dir .................... None
[2024-09-09 16:58:07,257] [INFO] [RANK 0] input_type ................... cli
[2024-09-09 16:58:07,257] [INFO] [RANK 0] input_file ................... input.txt
[2024-09-09 16:58:07,257] [INFO] [RANK 0] final_size ................... 2048
[2024-09-09 16:58:07,257] [INFO] [RANK 0] sdedit ....................... False
[2024-09-09 16:58:07,257] [INFO] [RANK 0] grid_num_rows ................ 1
[2024-09-09 16:58:07,257] [INFO] [RANK 0] force_inference .............. False
[2024-09-09 16:58:07,257] [INFO] [RANK 0] lcm_steps .................... None
[2024-09-09 16:58:07,257] [INFO] [RANK 0] sampling_num_frames .......... 32
[2024-09-09 16:58:07,257] [INFO] [RANK 0] sampling_fps ................. 8
[2024-09-09 16:58:07,258] [INFO] [RANK 0] only_save_latents ............ False
[2024-09-09 16:58:07,258] [INFO] [RANK 0] only_log_video_latents ....... True
[2024-09-09 16:58:07,258] [INFO] [RANK 0] latent_channels .............. 32
[2024-09-09 16:58:07,258] [INFO] [RANK 0] image2video .................. False
[2024-09-09 16:58:07,258] [INFO] [RANK 0] experiment_name .............. lora-test-09-09-16-57
[2024-09-09 16:58:07,258] [INFO] [RANK 0] train_iters .................. 100
[2024-09-09 16:58:07,258] [INFO] [RANK 0] batch_size ................... 2
[2024-09-09 16:58:07,258] [INFO] [RANK 0] lr ........................... 0.001
[2024-09-09 16:58:07,258] [INFO] [RANK 0] mode ......................... finetune
[2024-09-09 16:58:07,258] [INFO] [RANK 0] seed ......................... 5243
[2024-09-09 16:58:07,258] [INFO] [RANK 0] zero_stage ................... 0
[2024-09-09 16:58:07,258] [INFO] [RANK 0] checkpoint_activations ....... True
[2024-09-09 16:58:07,258] [INFO] [RANK 0] checkpoint_num_layers ........ 1
[2024-09-09 16:58:07,258] [INFO] [RANK 0] checkpoint_skip_layers ....... 0
[2024-09-09 16:58:07,258] [INFO] [RANK 0] fp16 ......................... True
[2024-09-09 16:58:07,258] [INFO] [RANK 0] bf16 ......................... False
[2024-09-09 16:58:07,258] [INFO] [RANK 0] gradient_accumulation_steps .. 1
[2024-09-09 16:58:07,258] [INFO] [RANK 0] profiling .................... -1
[2024-09-09 16:58:07,258] [INFO] [RANK 0] epochs ....................... None
[2024-09-09 16:58:07,258] [INFO] [RANK 0] log_interval ................. 20
[2024-09-09 16:58:07,258] [INFO] [RANK 0] summary_dir ..................
[2024-09-09 16:58:07,258] [INFO] [RANK 0] save_args .................... False
[2024-09-09 16:58:07,258] [INFO] [RANK 0] lr_decay_iters ............... None
[2024-09-09 16:58:07,258] [INFO] [RANK 0] lr_decay_style ............... linear
[2024-09-09 16:58:07,258] [INFO] [RANK 0] lr_decay_ratio ............... 0.1
[2024-09-09 16:58:07,258] [INFO] [RANK 0] warmup ....................... 0.01
[2024-09-09 16:58:07,258] [INFO] [RANK 0] weight_decay ................. 0.0001
[2024-09-09 16:58:07,258] [INFO] [RANK 0] save ......................... ckpts_2b_lora/lora-test-09-09-16-57
[2024-09-09 16:58:07,258] [INFO] [RANK 0] load ......................... /root/CogVideo/CogVideoX-2b-sat/transformer
[2024-09-09 16:58:07,258] [INFO] [RANK 0] force_train .................. True
[2024-09-09 16:58:07,258] [INFO] [RANK 0] save_interval ................ 50
[2024-09-09 16:58:07,258] [INFO] [RANK 0] no_save_rng .................. False
[2024-09-09 16:58:07,258] [INFO] [RANK 0] no_load_rng .................. True
[2024-09-09 16:58:07,259] [INFO] [RANK 0] resume_dataloader ............ False
[2024-09-09 16:58:07,259] [INFO] [RANK 0] distributed_backend .......... nccl
[2024-09-09 16:58:07,259] [INFO] [RANK 0] local_rank ................... 0
[2024-09-09 16:58:07,259] [INFO] [RANK 0] exit_interval ................ None
[2024-09-09 16:58:07,259] [INFO] [RANK 0] wandb ........................ False
[2024-09-09 16:58:07,259] [INFO] [RANK 0] wandb_project_name ........... default_project
[2024-09-09 16:58:07,259] [INFO] [RANK 0] eval_batch_size .............. 1
[2024-09-09 16:58:07,259] [INFO] [RANK 0] eval_iters ................... 1
[2024-09-09 16:58:07,259] [INFO] [RANK 0] eval_interval ................ 10
[2024-09-09 16:58:07,259] [INFO] [RANK 0] strict_eval .................. False
[2024-09-09 16:58:07,259] [INFO] [RANK 0] train_data ................... ['/root/CogVideo/sat/datasets/test']
[2024-09-09 16:58:07,259] [INFO] [RANK 0] train_data_weights ........... None
[2024-09-09 16:58:07,259] [INFO] [RANK 0] iterable_dataset ............. False
[2024-09-09 16:58:07,259] [INFO] [RANK 0] iterable_dataset_eval ........
[2024-09-09 16:58:07,259] [INFO] [RANK 0] batch_from_same_dataset ...... False
[2024-09-09 16:58:07,259] [INFO] [RANK 0] valid_data ................... ['/root/CogVideo/sat/datasets/test']
[2024-09-09 16:58:07,259] [INFO] [RANK 0] test_data .................... None
[2024-09-09 16:58:07,259] [INFO] [RANK 0] split ........................ 1,0,0
[2024-09-09 16:58:07,259] [INFO] [RANK 0] num_workers .................. 8
[2024-09-09 16:58:07,259] [INFO] [RANK 0] block_size ................... 10000
[2024-09-09 16:58:07,259] [INFO] [RANK 0] prefetch_factor .............. 4
[2024-09-09 16:58:07,259] [INFO] [RANK 0] deepspeed .................... True
[2024-09-09 16:58:07,259] [INFO] [RANK 0] deepspeed_config ............. {'train_micro_batch_size_per_gpu': 2, 'gradient_accumulation_steps': 1, 'steps_per_print': 50, 'gradient_clipping': 0.1, 'zero_optimization': {'stage': 2, 'cpu_offload': False, 'contiguous_gradients': False, 'overlap_comm': True, 'reduce_scatter': True, 'reduce_bucket_size': 1000000000, 'allgather_bucket_size': 1000000000, 'load_from_fp32_weights': False}, 'zero_allow_untested_optimizer': True, 'bf16': {'enabled': False}, 'fp16': {'enabled': True}, 'loss_scale': 0, 'loss_scale_window': 400, 'hysteresis': 2, 'min_loss_scale': 1, 'activation_checkpointing': {'partition_activations': False, 'contiguous_memory_optimization': False}, 'wall_clock_breakdown': False}
[2024-09-09 16:58:07,259] [INFO] [RANK 0] deepscale .................... False
[2024-09-09 16:58:07,259] [INFO] [RANK 0] deepscale_config ............. None
[2024-09-09 16:58:07,260] [INFO] [RANK 0] model_config ................. {'scale_factor': 1.15258426, 'disable_first_stage_autocast': True, 'not_trainable_prefixes': ['all'], 'log_keys': ['txt'], 'denoiser_config': {'target': 'sgm.modules.diffusionmodules.denoiser.DiscreteDenoiser', 'params': {'num_idx': 1000, 'quantize_c_noise': False, 'weighting_config': {'target': 'sgm.modules.diffusionmodules.denoiser_weighting.EpsWeighting'}, 'scaling_config': {'target': 'sgm.modules.diffusionmodules.denoiser_scaling.VideoScaling'}, 'discretization_config': {'target': 'sgm.modules.diffusionmodules.discretizer.ZeroSNRDDPMDiscretization', 'params': {'shift_scale': 3.0}}}}, 'network_config': {'target': 'dit_video_concat.DiffusionTransformer', 'params': {'time_embed_dim': 512, 'elementwise_affine': True, 'num_frames': 49, 'time_compressed_rate': 4, 'latent_width': 90, 'latent_height': 60, 'num_layers': 30, 'patch_size': 2, 'in_channels': 16, 'out_channels': 16, 'hidden_size': 1920, 'adm_in_channels': 256, 'num_attention_heads': 30, 'transformer_args': {'checkpoint_activations': True, 'vocab_size': 1, 'max_sequence_length': 64, 'layernorm_order': 'pre', 'skip_init': False, 'model_parallel_size': 1, 'is_decoder': False, 'num_layers': 30, 'hidden_size': 1920, 'num_attention_heads': 30, 'parallel_output': True}, 'modules': {'pos_embed_config': {'target': 'dit_video_concat.Basic3DPositionEmbeddingMixin', 'params': {'text_length': 226, 'height_interpolation': 1.875, 'width_interpolation': 1.875}}, 'lora_config': {'target': 'sat.model.finetune.lora2.LoraMixin', 'params': {'r': 128}}, 'patch_embed_config': {'target': 'dit_video_concat.ImagePatchEmbeddingMixin', 'params': {'text_hidden_size': 4096}}, 'adaln_layer_config': {'target': 'dit_video_concat.AdaLNMixin', 'params': {'qk_ln': True}}, 'final_layer_config': {'target': 'dit_video_concat.FinalLayerMixin'}}, 'dtype': 'fp16'}}, 'conditioner_config': {'target': 'sgm.modules.GeneralConditioner', 'params': {'emb_models': [{'is_trainable': False, 'input_key': 'txt', 'ucg_rate': 0.1, 'target': 'sgm.modules.encoders.modules.FrozenT5Embedder', 'params': {'model_dir': '/root/CogVideo/t5-v1_1-xxl', 'max_length': 226}}]}}, 'first_stage_config': {'target': 'vae_modules.autoencoder.VideoAutoencoderInferenceWrapper', 'params': {'cp_size': 1, 'ckpt_path': '/root/CogVideo/CogVideoX-2b-sat/vae/3d-vae.pt', 'ignore_keys': ['loss'], 'loss_config': {'target': 'torch.nn.Identity'}, 'regularizer_config': {'target': 'vae_modules.regularizers.DiagonalGaussianRegularizer'}, 'encoder_config': {'target': 'vae_modules.cp_enc_dec.ContextParallelEncoder3D', 'params': {'double_z': True, 'z_channels': 16, 'resolution': 256, 'in_channels': 3, 'out_ch': 3, 'ch': 128, 'ch_mult': [1, 2, 2, 4], 'attn_resolutions': [], 'num_res_blocks': 3, 'dropout': 0.0, 'gather_norm': True}}, 'decoder_config': {'target': 'vae_modules.cp_enc_dec.ContextParallelDecoder3D', 'params': {'double_z': True, 'z_channels': 16, 'resolution': 256, 'in_channels': 3, 'out_ch': 3, 'ch': 128, 'ch_mult': [1, 2, 2, 4], 'attn_resolutions': [], 'num_res_blocks': 3, 'dropout': 0.0, 'gather_norm': False}}}}, 'loss_fn_config': {'target': 'sgm.modules.diffusionmodules.loss.VideoDiffusionLoss', 'params': {'offset_noise_level': 0, 'sigma_sampler_config': {'target': 'sgm.modules.diffusionmodules.sigma_sampling.DiscreteSampling', 'params': {'uniform_sampling': True, 'num_idx': 1000, 'discretization_config': {'target': 'sgm.modules.diffusionmodules.discretizer.ZeroSNRDDPMDiscretization', 'params': {'shift_scale': 3.0}}}}}}, 'sampler_config': {'target': 'sgm.modules.diffusionmodules.sampling.VPSDEDPMPP2MSampler', 'params': {'num_steps': 50, 'verbose': True, 'discretization_config': {'target': 'sgm.modules.diffusionmodules.discretizer.ZeroSNRDDPMDiscretization', 'params': {'shift_scale': 3.0}}, 'guider_config': {'target': 'sgm.modules.diffusionmodules.guiders.DynamicCFG', 'params': {'scale': 6, 'exp': 5, 'num_steps': 50}}}}}
[2024-09-09 16:58:07,260] [INFO] [RANK 0] data_config .................. {'target': 'data_video.SFTDataset', 'params': {'video_size': [480, 720], 'fps': 8, 'max_num_frames': 49, 'skip_frms_num': 3.0}}
[2024-09-09 16:58:07,260] [INFO] [RANK 0] cuda ......................... True
[2024-09-09 16:58:07,260] [INFO] [RANK 0] rank ......................... 0
[2024-09-09 16:58:07,260] [INFO] [RANK 0] world_size ................... 1
[2024-09-09 16:58:07,260] [INFO] [RANK 0] deepspeed_activation_checkpointing True
[2024-09-09 16:58:07,260] [INFO] [RANK 0] master_ip .................... localhost
[2024-09-09 16:58:07,260] [INFO] [RANK 0] master_port .................. 57495
[2024-09-09 16:58:07,260] [INFO] [RANK 0] log_config ................... [{'model': {'scale_factor': 1.15258426, 'disable_first_stage_autocast': True, 'not_trainable_prefixes': ['all'], 'log_keys': ['txt'], 'denoiser_config': {'target': 'sgm.modules.diffusionmodules.denoiser.DiscreteDenoiser', 'params': {'num_idx': 1000, 'quantize_c_noise': False, 'weighting_config': {'target': 'sgm.modules.diffusionmodules.denoiser_weighting.EpsWeighting'}, 'scaling_config': {'target': 'sgm.modules.diffusionmodules.denoiser_scaling.VideoScaling'}, 'discretization_config': {'target': 'sgm.modules.diffusionmodules.discretizer.ZeroSNRDDPMDiscretization', 'params': {'shift_scale': 3.0}}}}, 'network_config': {'target': 'dit_video_concat.DiffusionTransformer', 'params': {'time_embed_dim': 512, 'elementwise_affine': True, 'num_frames': 49, 'time_compressed_rate': 4, 'latent_width': 90, 'latent_height': 60, 'num_layers': 30, 'patch_size': 2, 'in_channels': 16, 'out_channels': 16, 'hidden_size': 1920, 'adm_in_channels': 256, 'num_attention_heads': 30, 'transformer_args': {'checkpoint_activations': True, 'vocab_size': 1, 'max_sequence_length': 64, 'layernorm_order': 'pre', 'skip_init': False, 'model_parallel_size': 1, 'is_decoder': False}, 'modules': {'pos_embed_config': {'target': 'dit_video_concat.Basic3DPositionEmbeddingMixin', 'params': {'text_length': 226, 'height_interpolation': 1.875, 'width_interpolation': 1.875}}, 'lora_config': {'target': 'sat.model.finetune.lora2.LoraMixin', 'params': {'r': 128}}, 'patch_embed_config': {'target': 'dit_video_concat.ImagePatchEmbeddingMixin', 'params': {'text_hidden_size': 4096}}, 'adaln_layer_config': {'target': 'dit_video_concat.AdaLNMixin', 'params': {'qk_ln': True}}, 'final_layer_config': {'target': 'dit_video_concat.FinalLayerMixin'}}}}, 'conditioner_config': {'target': 'sgm.modules.GeneralConditioner', 'params': {'emb_models': [{'is_trainable': False, 'input_key': 'txt', 'ucg_rate': 0.1, 'target': 'sgm.modules.encoders.modules.FrozenT5Embedder', 'params': {'model_dir': '/root/CogVideo/t5-v1_1-xxl', 'max_length': 226}}]}}, 'first_stage_config': {'target': 'vae_modules.autoencoder.VideoAutoencoderInferenceWrapper', 'params': {'cp_size': 1, 'ckpt_path': '/root/CogVideo/CogVideoX-2b-sat/vae/3d-vae.pt', 'ignore_keys': ['loss'], 'loss_config': {'target': 'torch.nn.Identity'}, 'regularizer_config': {'target': 'vae_modules.regularizers.DiagonalGaussianRegularizer'}, 'encoder_config': {'target': 'vae_modules.cp_enc_dec.ContextParallelEncoder3D', 'params': {'double_z': True, 'z_channels': 16, 'resolution': 256, 'in_channels': 3, 'out_ch': 3, 'ch': 128, 'ch_mult': [1, 2, 2, 4], 'attn_resolutions': [], 'num_res_blocks': 3, 'dropout': 0.0, 'gather_norm': True}}, 'decoder_config': {'target': 'vae_modules.cp_enc_dec.ContextParallelDecoder3D', 'params': {'double_z': True, 'z_channels': 16, 'resolution': 256, 'in_channels': 3, 'out_ch': 3, 'ch': 128, 'ch_mult': [1, 2, 2, 4], 'attn_resolutions': [], 'num_res_blocks': 3, 'dropout': 0.0, 'gather_norm': False}}}}, 'loss_fn_config': {'target': 'sgm.modules.diffusionmodules.loss.VideoDiffusionLoss', 'params': {'offset_noise_level': 0, 'sigma_sampler_config': {'target': 'sgm.modules.diffusionmodules.sigma_sampling.DiscreteSampling', 'params': {'uniform_sampling': True, 'num_idx': 1000, 'discretization_config': {'target': 'sgm.modules.diffusionmodules.discretizer.ZeroSNRDDPMDiscretization', 'params': {'shift_scale': 3.0}}}}}}, 'sampler_config': {'target': 'sgm.modules.diffusionmodules.sampling.VPSDEDPMPP2MSampler', 'params': {'num_steps': 50, 'verbose': True, 'discretization_config': {'target': 'sgm.modules.diffusionmodules.discretizer.ZeroSNRDDPMDiscretization', 'params': {'shift_scale': 3.0}}, 'guider_config': {'target': 'sgm.modules.diffusionmodules.guiders.DynamicCFG', 'params': {'scale': 6, 'exp': 5, 'num_steps': 50}}}}}}, {'args': {'checkpoint_activations': True, 'model_parallel_size': 1, 'experiment_name': 'lora-test', 'mode': 'finetune', 'load': '/root/CogVideo/CogVideoX-2b-sat/transformer', 'no_load_rng': True, 'train_iters': 100, 'eval_iters': 1, 'eval_interval': 10, 'eval_batch_size': 1, 'save': 'ckpts_2b_lora', 'save_interval': 50, 'log_interval': 20, 'train_data': ['/root/CogVideo/sat/datasets/test'], 'valid_data': ['/root/CogVideo/sat/datasets/test'], 'split': '1,0,0', 'num_workers': 8, 'force_train': True, 'only_log_video_latents': True}, 'data': {'target': 'data_video.SFTDataset', 'params': {'video_size': [480, 720], 'fps': 8, 'max_num_frames': 49, 'skip_frms_num': 3.0}}, 'deepspeed': {'train_micro_batch_size_per_gpu': 2, 'gradient_accumulation_steps': 1, 'steps_per_print': 50, 'gradient_clipping': 0.1, 'zero_optimization': {'stage': 2, 'cpu_offload': False, 'contiguous_gradients': False, 'overlap_comm': True, 'reduce_scatter': True, 'reduce_bucket_size': 1000000000, 'allgather_bucket_size': 1000000000, 'load_from_fp32_weights': False}, 'zero_allow_untested_optimizer': True, 'bf16': {'enabled': False}, 'fp16': {'enabled': True}, 'loss_scale': 0, 'loss_scale_window': 400, 'hysteresis': 2, 'min_loss_scale': 1, 'optimizer': {'type': 'sat.ops.FusedEmaAdam', 'params': {'lr': 0.001, 'betas': [0.9, 0.95], 'eps': '1e-8', 'weight_decay': '1e-4'}}, 'activation_checkpointing': {'partition_activations': False, 'contiguous_memory_optimization': False}, 'wall_clock_breakdown': False}}]
[2024-09-09 16:58:07,260] [INFO] [RANK 0] do_train ..................... True
[2024-09-09 16:58:07,260] [INFO] [RANK 0] val_last_shape ............... []
[2024-09-09 16:58:07,260] [INFO] [RANK 0] val_drop_number .............. 0
[2024-09-09 16:58:07,260] [INFO] [RANK 0] do_valid ..................... True
[2024-09-09 16:58:07,260] [INFO] [RANK 0] do_test ...................... False
[2024-09-09 16:58:07,260] [INFO] [RANK 0] iteration .................... 0
[2024-09-09 16:58:56,248] [INFO] [checkpointing.py:541:forward] Activation Checkpointing Information
[2024-09-09 16:58:56,248] [INFO] [checkpointing.py:542:forward] ----Partition Activations False, CPU CHECKPOINTING False
[2024-09-09 16:58:56,248] [INFO] [checkpointing.py:543:forward] ----contiguous Memory Checkpointing False with None total layers
[2024-09-09 16:58:56,248] [INFO] [checkpointing.py:545:forward] ----Synchronization False
[2024-09-09 16:58:56,248] [INFO] [checkpointing.py:546:forward] ----Profiling time in checkpointing False
[2024-09-09 16:59:06,008] [INFO] [loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 4294967296, reducing to 2147483648
[2024-09-09 16:59:29,703] [INFO] [loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 2147483648, reducing to 1073741824
[2024-09-09 16:59:53,115] [INFO] [loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 1073741824, reducing to 536870912
[2024-09-09 17:01:04,649] [INFO] [loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 536870912, reducing to 268435456
[2024-09-09 17:01:51,938] [INFO] [loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 268435456, reducing to 134217728
/root/CogVideo/sat/train_video.py:67: DeprecationWarning: torch.get_autocast_gpu_dtype() is deprecated. Please use torch.get_autocast_dtype('cuda') instead. (Triggered internally at ../torch/csrc/autograd/init.cpp:733.)
"dtype": torch.get_autocast_gpu_dtype(),
/root/CogVideo/sat/train_video.py:70: FutureWarning: `torch.cuda.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cuda', args...)` instead.
with torch.no_grad(), torch.cuda.amp.autocast(**gpu_autocast_kwargs):
############################## Sampling setting ##############################
Sampler: VPSDEDPMPP2MSampler
Discretization: ZeroSNRDDPMDiscretization
Guider: DynamicCFG
Sampling with VPSDEDPMPP2MSampler for 51 steps: 98%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▉ | 50/51 [01:24<00:01, 1.69s/it]
[2024-09-09 17:04:19,474] [INFO] [RANK 0] ----------------------------------------------------------------------------------------------------
[2024-09-09 17:04:19,474] [INFO] [RANK 0] ----------------------------------------------------------------------------------------------
[2024-09-09 17:04:19,474] [INFO] [RANK 0] validation loss at iteration 10 | loss: 1.002032E-01 | PPL: 1.105395E+00 loss 1.002032E-01 |
[2024-09-09 17:04:19,474] [INFO] [RANK 0] ----------------------------------------------------------------------------------------------
[2024-09-09 17:05:49,038] [INFO] [loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 134217728, reducing to 67108864
[rank0]: Traceback (most recent call last):
[rank0]: File "/root/CogVideo/sat/train_video.py", line 226, in <module>
[rank0]: training_main(
[rank0]: File "/root/miniconda3/envs/cogvideo/lib/python3.12/site-packages/sat/training/deepspeed_training.py", line 157, in training_main
[rank0]: iteration, skipped = train(model, optimizer,
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/root/miniconda3/envs/cogvideo/lib/python3.12/site-packages/sat/training/deepspeed_training.py", line 359, in train
[rank0]: lm_loss, skipped_iter, metrics = train_step(train_data_iterator,
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/root/miniconda3/envs/cogvideo/lib/python3.12/site-packages/sat/training/deepspeed_training.py", line 443, in train_step
[rank0]: forward_ret = forward_step(data_iterator, model, args, timers, **kwargs)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/root/CogVideo/sat/train_video.py", line 176, in forward_step
[rank0]: batch = next(data_iterator)
[rank0]: ^^^^^^^^^^^^^^^^^^^
[rank0]: File "/root/miniconda3/envs/cogvideo/lib/python3.12/site-packages/torch/utils/data/dataloader.py", line 630, in __next__
[rank0]: data = self._next_data()
[rank0]: ^^^^^^^^^^^^^^^^^
[rank0]: File "/root/miniconda3/envs/cogvideo/lib/python3.12/site-packages/torch/utils/data/dataloader.py", line 1324, in _next_data
[rank0]: return self._process_data(data)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/root/miniconda3/envs/cogvideo/lib/python3.12/site-packages/torch/utils/data/dataloader.py", line 1370, in _process_data
[rank0]: data.reraise()
[rank0]: File "/root/miniconda3/envs/cogvideo/lib/python3.12/site-packages/torch/_utils.py", line 706, in reraise
[rank0]: raise exception
[rank0]: ZeroDivisionError: Caught ZeroDivisionError in DataLoader worker process 7.
[rank0]: Original Traceback (most recent call last):
[rank0]: File "/root/miniconda3/envs/cogvideo/lib/python3.12/site-packages/torch/utils/data/_utils/worker.py", line 309, in _worker_loop
[rank0]: data = fetcher.fetch(index) # type: ignore[possibly-undefined]
[rank0]: ^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/root/miniconda3/envs/cogvideo/lib/python3.12/site-packages/torch/utils/data/_utils/fetch.py", line 52, in fetch
[rank0]: data = [self.dataset[idx] for idx in possibly_batched_index]
[rank0]: ~~~~~~~~~~~~^^^^^
[rank0]: File "/root/miniconda3/envs/cogvideo/lib/python3.12/site-packages/sat/data_utils/configure_data.py", line 360, in __getitem__
[rank0]: return self.wrapped_data[index]
[rank0]: ~~~~~~~~~~~~~~~~~^^^^^^^
[rank0]: File "/root/miniconda3/envs/cogvideo/lib/python3.12/site-packages/sat/data_utils/configure_data.py", line 342, in __getitem__
[rank0]: return self.datasets[dataset_idx][sample_idx]
[rank0]: ~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^
[rank0]: File "/root/CogVideo/sat/data_video.py", line 411, in __getitem__
[rank0]: indices = np.arange(start, end, (end - start) // num_frames).astype(int)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: ZeroDivisionError: division by zero
[3rd Trial] Reduce train_micro_batch_size_per_gpu 2->1
(cogvideo) root@alphacode-ttv-a100-80g-gpu:~/CogVideo/sat# bash finetune_single_gpu.sh
RUN on alphacode-ttv-a100-80g-gpu, CUDA_VISIBLE_DEVICES=0
WORLD_SIZE=1 RANK=0 LOCAL_RANK=0 LOCAL_WORLD_SIZE=1 python train_video.py --base configs/cogvideox_2b_lora.yaml configs/sft.yaml --seed 27481
[2024-09-10 13:30:54,235] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[WARNING] async_io requires the dev libaio .so object and headers but these were not found.
[WARNING] async_io: please install the libaio-dev package with apt
[WARNING] If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
[WARNING] Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
[WARNING] sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.4
[WARNING] using untested triton version (3.0.0), only 1.0.0 is known to be compatible
/root/miniconda3/envs/cogvideo/lib/python3.12/site-packages/deepspeed/runtime/zero/linear.py:47: FutureWarning: `torch.cuda.amp.custom_fwd(args...)` is deprecated. Please use `torch.amp.custom_fwd(args..., device_type='cuda')` instead.
@autocast_custom_fwd
/root/miniconda3/envs/cogvideo/lib/python3.12/site-packages/deepspeed/runtime/zero/linear.py:66: FutureWarning: `torch.cuda.amp.custom_bwd(args...)` is deprecated. Please use `torch.amp.custom_bwd(args..., device_type='cuda')` instead.
@autocast_custom_bwd
/root/miniconda3/envs/cogvideo/lib/python3.12/site-packages/kornia/feature/lightglue.py:44: FutureWarning: `torch.cuda.amp.custom_fwd(args...)` is deprecated. Please use `torch.amp.custom_fwd(args..., device_type='cuda')` instead.
@torch.cuda.amp.custom_fwd(cast_inputs=torch.float32)
no module 'xformers'. Processing without...
no module 'xformers'. Processing without...
[2024-09-10 13:30:59,512] [INFO] using world size: 1
[2024-09-10 13:30:59,512] [INFO] Will override arguments with manually specified deepspeed_config!
[W910 13:30:59.341356778 socket.cpp:697] [c10d] The client socket cannot be initialized to connect to [ip6-localhost]:44107 (errno: 97 - Address family not supported by protocol).
[W910 13:30:59.342068481 socket.cpp:697] [c10d] The client socket cannot be initialized to connect to [alphacode-ttv-a100-80g-gpu]:44107 (errno: 97 - Address family not supported by protocol).
[2024-09-10 13:30:59,519] [INFO] [RANK 0] > initializing model parallel with size 1
[2024-09-10 13:30:59,520] [INFO] [comm.py:637:init_distributed] cdb=None
[2024-09-10 13:30:59,755] [INFO] [RANK 0] building SATVideoDiffusionEngine model ...
[2024-09-10 13:31:08,092] [INFO] [RANK 0] replacing layer 0 attention with lora
[2024-09-10 13:31:08,207] [INFO] [RANK 0] replacing layer 1 attention with lora
[2024-09-10 13:31:08,324] [INFO] [RANK 0] replacing layer 2 attention with lora
[2024-09-10 13:31:08,439] [INFO] [RANK 0] replacing layer 3 attention with lora
[2024-09-10 13:31:08,558] [INFO] [RANK 0] replacing layer 4 attention with lora
[2024-09-10 13:31:08,691] [INFO] [RANK 0] replacing layer 5 attention with lora
[2024-09-10 13:31:08,819] [INFO] [RANK 0] replacing layer 6 attention with lora
[2024-09-10 13:31:08,939] [INFO] [RANK 0] replacing layer 7 attention with lora
[2024-09-10 13:31:09,068] [INFO] [RANK 0] replacing layer 8 attention with lora
[2024-09-10 13:31:09,184] [INFO] [RANK 0] replacing layer 9 attention with lora
[2024-09-10 13:31:09,238] [INFO] [RANK 0] replacing layer 10 attention with lora
[2024-09-10 13:31:09,258] [INFO] [RANK 0] replacing layer 11 attention with lora
[2024-09-10 13:31:09,280] [INFO] [RANK 0] replacing layer 12 attention with lora
[2024-09-10 13:31:09,302] [INFO] [RANK 0] replacing layer 13 attention with lora
[2024-09-10 13:31:09,324] [INFO] [RANK 0] replacing layer 14 attention with lora
[2024-09-10 13:31:09,346] [INFO] [RANK 0] replacing layer 15 attention with lora
[2024-09-10 13:31:09,368] [INFO] [RANK 0] replacing layer 16 attention with lora
[2024-09-10 13:31:09,389] [INFO] [RANK 0] replacing layer 17 attention with lora
[2024-09-10 13:31:09,445] [INFO] [RANK 0] replacing layer 18 attention with lora
[2024-09-10 13:31:09,471] [INFO] [RANK 0] replacing layer 19 attention with lora
[2024-09-10 13:31:09,494] [INFO] [RANK 0] replacing layer 20 attention with lora
[2024-09-10 13:31:09,513] [INFO] [RANK 0] replacing layer 21 attention with lora
[2024-09-10 13:31:09,532] [INFO] [RANK 0] replacing layer 22 attention with lora
[2024-09-10 13:31:09,551] [INFO] [RANK 0] replacing layer 23 attention with lora
[2024-09-10 13:31:09,570] [INFO] [RANK 0] replacing layer 24 attention with lora
[2024-09-10 13:31:09,589] [INFO] [RANK 0] replacing layer 25 attention with lora
[2024-09-10 13:31:09,609] [INFO] [RANK 0] replacing layer 26 attention with lora
[2024-09-10 13:31:09,661] [INFO] [RANK 0] replacing layer 27 attention with lora
[2024-09-10 13:31:09,682] [INFO] [RANK 0] replacing layer 28 attention with lora
[2024-09-10 13:31:09,705] [INFO] [RANK 0] replacing layer 29 attention with lora
Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:01<00:00, 1.13it/s]
Initialized embedder #0: FrozenT5Embedder with 4762310656 params. Trainable: False
Working with z of shape (1, 16, 32, 32) = 16384 dimensions.
/root/CogVideo/sat/vae_modules/autoencoder.py:565: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature.
sd = torch.load(path, map_location="cpu")["state_dict"]
Deleting key loss.logvar from state_dict.
Deleting key loss.perceptual_loss.scaling_layer.shift from state_dict.
Deleting key loss.perceptual_loss.scaling_layer.scale from state_dict.
Deleting key loss.perceptual_loss.net.slice1.0.weight from state_dict.
Deleting key loss.perceptual_loss.net.slice1.0.bias from state_dict.
Deleting key loss.perceptual_loss.net.slice1.2.weight from state_dict.
Deleting key loss.perceptual_loss.net.slice1.2.bias from state_dict.
Deleting key loss.perceptual_loss.net.slice2.5.weight from state_dict.
Deleting key loss.perceptual_loss.net.slice2.5.bias from state_dict.
Deleting key loss.perceptual_loss.net.slice2.7.weight from state_dict.
Deleting key loss.perceptual_loss.net.slice2.7.bias from state_dict.
Deleting key loss.perceptual_loss.net.slice3.10.weight from state_dict.
Deleting key loss.perceptual_loss.net.slice3.10.bias from state_dict.
Deleting key loss.perceptual_loss.net.slice3.12.weight from state_dict.
Deleting key loss.perceptual_loss.net.slice3.12.bias from state_dict.
Deleting key loss.perceptual_loss.net.slice3.14.weight from state_dict.
Deleting key loss.perceptual_loss.net.slice3.14.bias from state_dict.
Deleting key loss.perceptual_loss.net.slice4.17.weight from state_dict.
Deleting key loss.perceptual_loss.net.slice4.17.bias from state_dict.
Deleting key loss.perceptual_loss.net.slice4.19.weight from state_dict.
Deleting key loss.perceptual_loss.net.slice4.19.bias from state_dict.
Deleting key loss.perceptual_loss.net.slice4.21.weight from state_dict.
Deleting key loss.perceptual_loss.net.slice4.21.bias from state_dict.
Deleting key loss.perceptual_loss.net.slice5.24.weight from state_dict.
Deleting key loss.perceptual_loss.net.slice5.24.bias from state_dict.
Deleting key loss.perceptual_loss.net.slice5.26.weight from state_dict.
Deleting key loss.perceptual_loss.net.slice5.26.bias from state_dict.
Deleting key loss.perceptual_loss.net.slice5.28.weight from state_dict.
Deleting key loss.perceptual_loss.net.slice5.28.bias from state_dict.
Deleting key loss.perceptual_loss.lin0.model.1.weight from state_dict.
Deleting key loss.perceptual_loss.lin1.model.1.weight from state_dict.
Deleting key loss.perceptual_loss.lin2.model.1.weight from state_dict.
Deleting key loss.perceptual_loss.lin3.model.1.weight from state_dict.
Deleting key loss.perceptual_loss.lin4.model.1.weight from state_dict.
Deleting key loss.discriminator.blocks.0.downsample_res.conv.weight from state_dict.
Deleting key loss.discriminator.blocks.0.downsample_res.conv.bias from state_dict.
Deleting key loss.discriminator.blocks.0.net.0.conv.weight from state_dict.
Deleting key loss.discriminator.blocks.0.net.0.conv.bias from state_dict.
Deleting key loss.discriminator.blocks.0.net.2.conv.weight from state_dict.
Deleting key loss.discriminator.blocks.0.net.2.conv.bias from state_dict.
Deleting key loss.discriminator.blocks.0.downsample.conv.weight from state_dict.
Deleting key loss.discriminator.blocks.0.downsample.conv.bias from state_dict.
Deleting key loss.discriminator.blocks.1.downsample_res.conv.weight from state_dict.
Deleting key loss.discriminator.blocks.1.downsample_res.conv.bias from state_dict.
Deleting key loss.discriminator.blocks.1.net.0.conv.weight from state_dict.
Deleting key loss.discriminator.blocks.1.net.0.conv.bias from state_dict.
Deleting key loss.discriminator.blocks.1.net.2.conv.weight from state_dict.
Deleting key loss.discriminator.blocks.1.net.2.conv.bias from state_dict.
Deleting key loss.discriminator.blocks.1.downsample.conv.weight from state_dict.
Deleting key loss.discriminator.blocks.1.downsample.conv.bias from state_dict.
Deleting key loss.discriminator.blocks.2.downsample_res.conv.weight from state_dict.
Deleting key loss.discriminator.blocks.2.downsample_res.conv.bias from state_dict.
Deleting key loss.discriminator.blocks.2.net.0.conv.weight from state_dict.
Deleting key loss.discriminator.blocks.2.net.0.conv.bias from state_dict.
Deleting key loss.discriminator.blocks.2.net.2.conv.weight from state_dict.
Deleting key loss.discriminator.blocks.2.net.2.conv.bias from state_dict.
Deleting key loss.discriminator.blocks.2.downsample.conv.weight from state_dict.
Deleting key loss.discriminator.blocks.2.downsample.conv.bias from state_dict.
Deleting key loss.discriminator.blocks.3.downsample_res.conv.weight from state_dict.
Deleting key loss.discriminator.blocks.3.downsample_res.conv.bias from state_dict.
Deleting key loss.discriminator.blocks.3.net.0.conv.weight from state_dict.
Deleting key loss.discriminator.blocks.3.net.0.conv.bias from state_dict.
Deleting key loss.discriminator.blocks.3.net.2.conv.weight from state_dict.
Deleting key loss.discriminator.blocks.3.net.2.conv.bias from state_dict.
Deleting key loss.discriminator.blocks.3.downsample.conv.weight from state_dict.
Deleting key loss.discriminator.blocks.3.downsample.conv.bias from state_dict.
Deleting key loss.discriminator.blocks.4.0.conv_res.weight from state_dict.
Deleting key loss.discriminator.blocks.4.0.conv_res.bias from state_dict.
Deleting key loss.discriminator.blocks.4.0.net.0.weight from state_dict.
Deleting key loss.discriminator.blocks.4.0.net.0.bias from state_dict.
Deleting key loss.discriminator.blocks.4.0.net.2.weight from state_dict.
Deleting key loss.discriminator.blocks.4.0.net.2.bias from state_dict.
Deleting key loss.discriminator.blocks.4.0.downsample.1.weight from state_dict.
Deleting key loss.discriminator.blocks.4.0.downsample.1.bias from state_dict.
Deleting key loss.discriminator.blocks.4.1.0.fn.norm.gamma from state_dict.
Deleting key loss.discriminator.blocks.4.1.0.fn.attn.to_q.0.weight from state_dict.
Deleting key loss.discriminator.blocks.4.1.0.fn.attn.to_kv.0.weight from state_dict.
Deleting key loss.discriminator.blocks.4.1.0.fn.attn.to_out.0.weight from state_dict.
Deleting key loss.discriminator.blocks.4.1.1.fn.norm.gamma from state_dict.
Deleting key loss.discriminator.blocks.4.1.1.fn.net.0.weight from state_dict.
Deleting key loss.discriminator.blocks.4.1.1.fn.net.0.bias from state_dict.
Deleting key loss.discriminator.blocks.4.1.1.fn.net.2.weight from state_dict.
Deleting key loss.discriminator.blocks.4.1.1.fn.net.2.bias from state_dict.
Deleting key loss.discriminator.blocks.5.0.conv_res.weight from state_dict.
Deleting key loss.discriminator.blocks.5.0.conv_res.bias from state_dict.
Deleting key loss.discriminator.blocks.5.0.net.0.weight from state_dict.
Deleting key loss.discriminator.blocks.5.0.net.0.bias from state_dict.
Deleting key loss.discriminator.blocks.5.0.net.2.weight from state_dict.
Deleting key loss.discriminator.blocks.5.0.net.2.bias from state_dict.
Deleting key loss.discriminator.blocks.5.0.downsample.1.weight from state_dict.
Deleting key loss.discriminator.blocks.5.0.downsample.1.bias from state_dict.
Deleting key loss.discriminator.blocks.5.1.0.fn.norm.gamma from state_dict.
Deleting key loss.discriminator.blocks.5.1.0.fn.attn.to_q.0.weight from state_dict.
Deleting key loss.discriminator.blocks.5.1.0.fn.attn.to_kv.0.weight from state_dict.
Deleting key loss.discriminator.blocks.5.1.0.fn.attn.to_out.0.weight from state_dict.
Deleting key loss.discriminator.blocks.5.1.1.fn.norm.gamma from state_dict.
Deleting key loss.discriminator.blocks.5.1.1.fn.net.0.weight from state_dict.
Deleting key loss.discriminator.blocks.5.1.1.fn.net.0.bias from state_dict.
Deleting key loss.discriminator.blocks.5.1.1.fn.net.2.weight from state_dict.
Deleting key loss.discriminator.blocks.5.1.1.fn.net.2.bias from state_dict.
Deleting key loss.discriminator.blocks.6.0.conv_res.weight from state_dict.
Deleting key loss.discriminator.blocks.6.0.conv_res.bias from state_dict.
Deleting key loss.discriminator.blocks.6.0.net.0.weight from state_dict.
Deleting key loss.discriminator.blocks.6.0.net.0.bias from state_dict.
Deleting key loss.discriminator.blocks.6.0.net.2.weight from state_dict.
Deleting key loss.discriminator.blocks.6.0.net.2.bias from state_dict.
Deleting key loss.discriminator.blocks.6.1.0.fn.norm.gamma from state_dict.
Deleting key loss.discriminator.blocks.6.1.0.fn.attn.to_q.0.weight from state_dict.
Deleting key loss.discriminator.blocks.6.1.0.fn.attn.to_kv.0.weight from state_dict.
Deleting key loss.discriminator.blocks.6.1.0.fn.attn.to_out.0.weight from state_dict.
Deleting key loss.discriminator.blocks.6.1.1.fn.norm.gamma from state_dict.
Deleting key loss.discriminator.blocks.6.1.1.fn.net.0.weight from state_dict.
Deleting key loss.discriminator.blocks.6.1.1.fn.net.0.bias from state_dict.
Deleting key loss.discriminator.blocks.6.1.1.fn.net.2.weight from state_dict.
Deleting key loss.discriminator.blocks.6.1.1.fn.net.2.bias from state_dict.
Deleting key loss.discriminator.to_logits.0.weight from state_dict.
Deleting key loss.discriminator.to_logits.0.bias from state_dict.
Deleting key loss.discriminator.to_logits.3.weight from state_dict.
Deleting key loss.discriminator.to_logits.3.bias from state_dict.
Missing keys: []
Unexpected keys: []
Restored from /root/CogVideo/CogVideoX-2b-sat/vae/3d-vae.pt
[2024-09-10 13:31:15,450] [INFO] [RANK 0] > number of parameters on model parallel rank 0: 6764790755
[2024-09-10 13:31:26,160] [INFO] [RANK 0] global rank 0 is loading checkpoint /root/CogVideo/CogVideoX-2b-sat/transformer/1000/mp_rank_00_model_states.pt
/root/miniconda3/envs/cogvideo/lib/python3.12/site-packages/sat/training/model_io.py:286: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature.
sd = torch.load(checkpoint_name, map_location='cpu')
[2024-09-10 13:31:27,666] [INFO] [RANK 0] > successfully loaded /root/CogVideo/CogVideoX-2b-sat/transformer/1000/mp_rank_00_model_states.pt
[2024-09-10 13:31:28,191] [INFO] [RANK 0] ***** Total trainable parameters: 58982400 *****
[2024-09-10 13:31:28,191] [INFO] [RANK 0] [<class 'sat.ops.layernorm.LayerNorm'>, <class 'torch.nn.modules.normalization.LayerNorm'>, <class 'sat.ops.layernorm.RMSNorm'>] is set to no_weight_decay
[2024-09-10 13:31:28,194] [INFO] [RANK 0] Syncing initialized parameters...
[2024-09-10 13:31:28,302] [INFO] [RANK 0] Finished syncing initialized parameters.
[2024-09-10 13:31:28,302] [INFO] [RANK 0] Using optimizer sat.ops.FusedEmaAdam from sat.
[2024-09-10 13:31:28,302] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed info: version=0.14.4, git-hash=unknown, git-branch=unknown
[2024-09-10 13:31:28,303] [WARNING] [config_utils.py:69:_process_deprecated_field] Config parameter cpu_offload is deprecated use offload_optimizer instead
[2024-09-10 13:31:28,390] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Flops Profiler Enabled: False
Using /root/.cache/torch_extensions/py312_cu121 as PyTorch extensions root...
Detected CUDA files, patching ldflags
Emitting ninja build file /root/.cache/torch_extensions/py312_cu121/fused_ema_adam/build.ninja...
/root/miniconda3/envs/cogvideo/lib/python3.12/site-packages/torch/utils/cpp_extension.py:1965: UserWarning: TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation.
If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'].
warnings.warn(
Building extension module fused_ema_adam...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module fused_ema_adam...
Time to load fused_ema_adam op: 0.7197697162628174 seconds
[2024-09-10 13:31:29,264] [INFO] [logging.py:96:log_dist] [Rank 0] Using client callable to create basic optimizer
[2024-09-10 13:31:29,264] [INFO] [logging.py:96:log_dist] [Rank 0] Removing param_group that has no 'params' in the basic Optimizer
[2024-09-10 13:31:29,284] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Basic Optimizer = FusedEmaAdam
[2024-09-10 13:31:29,284] [INFO] [utils.py:56:is_zero_supported_optimizer] Checking ZeRO support for optimizer=FusedEmaAdam type=<class 'sat.ops.fused_ema_adam.FusedEmaAdam'>
[2024-09-10 13:31:29,284] [WARNING] [engine.py:1179:_do_optimizer_sanity_check] **** You are using ZeRO with an untested optimizer, proceed with caution *****
[2024-09-10 13:31:29,284] [INFO] [logging.py:96:log_dist] [Rank 0] Creating torch.float16 ZeRO stage 2 optimizer
[2024-09-10 13:31:29,284] [INFO] [stage_1_and_2.py:148:__init__] Reduce bucket size 1000000000
[2024-09-10 13:31:29,284] [INFO] [stage_1_and_2.py:149:__init__] Allgather bucket size 1000000000
[2024-09-10 13:31:29,284] [INFO] [stage_1_and_2.py:150:__init__] CPU Offload: False
[2024-09-10 13:31:29,284] [INFO] [stage_1_and_2.py:151:__init__] Round robin gradient partitioning: False
[2024-09-10 13:31:31,672] [INFO] [utils.py:781:see_memory_usage] Before initializing optimizer states
[2024-09-10 13:31:31,673] [INFO] [utils.py:782:see_memory_usage] MA 12.86 GB Max_MA 12.97 GB CA 13.23 GB Max_CA 13 GB
[2024-09-10 13:31:31,673] [INFO] [utils.py:789:see_memory_usage] CPU Virtual Memory: used = 28.67 GB, percent = 1.5%
[2024-09-10 13:31:31,880] [INFO] [utils.py:781:see_memory_usage] After initializing optimizer states
[2024-09-10 13:31:31,880] [INFO] [utils.py:782:see_memory_usage] MA 12.86 GB Max_MA 13.08 GB CA 13.45 GB Max_CA 13 GB
[2024-09-10 13:31:31,880] [INFO] [utils.py:789:see_memory_usage] CPU Virtual Memory: used = 28.41 GB, percent = 1.5%
[2024-09-10 13:31:31,880] [INFO] [stage_1_and_2.py:543:__init__] optimizer state initialized
[2024-09-10 13:31:32,107] [INFO] [utils.py:781:see_memory_usage] After initializing ZeRO optimizer
[2024-09-10 13:31:32,107] [INFO] [utils.py:782:see_memory_usage] MA 12.86 GB Max_MA 12.86 GB CA 13.45 GB Max_CA 13 GB
[2024-09-10 13:31:32,107] [INFO] [utils.py:789:see_memory_usage] CPU Virtual Memory: used = 28.58 GB, percent = 1.5%
[2024-09-10 13:31:32,111] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Final Optimizer = DeepSpeedZeroOptimizer
[2024-09-10 13:31:32,112] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed using client LR scheduler
[2024-09-10 13:31:32,112] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed LR Scheduler = None
[2024-09-10 13:31:32,112] [INFO] [logging.py:96:log_dist] [Rank 0] step=0, skipped=0, lr=[1.0], mom=[[0.9, 0.95]]
[2024-09-10 13:31:32,114] [INFO] [config.py:997:print] DeepSpeedEngine configuration:
[2024-09-10 13:31:32,114] [INFO] [config.py:1001:print] activation_checkpointing_config {
"partition_activations": false,
"contiguous_memory_optimization": false,
"cpu_checkpointing": false,
"number_checkpoints": null,
"synchronize_checkpoint_boundary": false,
"profile": false
}
[2024-09-10 13:31:32,115] [INFO] [config.py:1001:print] aio_config ................... {'block_size': 1048576, 'queue_depth': 8, 'thread_count': 1, 'single_submit': False, 'overlap_events': True}
[2024-09-10 13:31:32,115] [INFO] [config.py:1001:print] amp_enabled .................. False
[2024-09-10 13:31:32,115] [INFO] [config.py:1001:print] amp_params ................... False
[2024-09-10 13:31:32,115] [INFO] [config.py:1001:print] autotuning_config ............ {
"enabled": false,
"start_step": null,
"end_step": null,
"metric_path": null,
"arg_mappings": null,
"metric": "throughput",
"model_info": null,
"results_dir": "autotuning_results",
"exps_dir": "autotuning_exps",
"overwrite": true,
"fast": true,
"start_profile_step": 3,
"end_profile_step": 5,
"tuner_type": "gridsearch",
"tuner_early_stopping": 5,
"tuner_num_trials": 50,
"model_info_path": null,
"mp_size": 1,
"max_train_batch_size": null,
"min_train_batch_size": 1,
"max_train_micro_batch_size_per_gpu": 1.024000e+03,
"min_train_micro_batch_size_per_gpu": 1,
"num_tuning_micro_batch_sizes": 3
}
[2024-09-10 13:31:32,115] [INFO] [config.py:1001:print] bfloat16_enabled ............. False
[2024-09-10 13:31:32,115] [INFO] [config.py:1001:print] bfloat16_immediate_grad_update False
[2024-09-10 13:31:32,115] [INFO] [config.py:1001:print] checkpoint_parallel_write_pipeline False
[2024-09-10 13:31:32,115] [INFO] [config.py:1001:print] checkpoint_tag_validation_enabled True
[2024-09-10 13:31:32,115] [INFO] [config.py:1001:print] checkpoint_tag_validation_fail False
[2024-09-10 13:31:32,115] [INFO] [config.py:1001:print] comms_config ................. <deepspeed.comm.config.DeepSpeedCommsConfig object at 0x7fa0f4c776e0>
[2024-09-10 13:31:32,115] [INFO] [config.py:1001:print] communication_data_type ...... None
[2024-09-10 13:31:32,115] [INFO] [config.py:1001:print] compression_config ........... {'weight_quantization': {'shared_parameters': {'enabled': False, 'quantizer_kernel': False, 'schedule_offset': 0, 'quantize_groups': 1, 'quantize_verbose': False, 'quantization_type': 'symmetric', 'quantize_weight_in_forward': False, 'rounding': 'nearest', 'fp16_mixed_quantize': False, 'quantize_change_ratio': 0.001}, 'different_groups': {}}, 'activation_quantization': {'shared_parameters': {'enabled': False, 'quantization_type': 'symmetric', 'range_calibration': 'dynamic', 'schedule_offset': 1000}, 'different_groups': {}}, 'sparse_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'row_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'head_pruning': {'shared_parameters': {'enabled': False, 'method': 'topk', 'schedule_offset': 1000}, 'different_groups': {}}, 'channel_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'layer_reduction': {'enabled': False}}
[2024-09-10 13:31:32,115] [INFO] [config.py:1001:print] curriculum_enabled_legacy .... False
[2024-09-10 13:31:32,115] [INFO] [config.py:1001:print] curriculum_params_legacy ..... False
[2024-09-10 13:31:32,115] [INFO] [config.py:1001:print] data_efficiency_config ....... {'enabled': False, 'seed': 1234, 'data_sampling': {'enabled': False, 'num_epochs': 1000, 'num_workers': 0, 'curriculum_learning': {'enabled': False}}, 'data_routing': {'enabled': False, 'random_ltd': {'enabled': False, 'layer_token_lr_schedule': {'enabled': False}}}}
[2024-09-10 13:31:32,115] [INFO] [config.py:1001:print] data_efficiency_enabled ...... False
[2024-09-10 13:31:32,115] [INFO] [config.py:1001:print] dataloader_drop_last ......... False
[2024-09-10 13:31:32,115] [INFO] [config.py:1001:print] disable_allgather ............ False
[2024-09-10 13:31:32,115] [INFO] [config.py:1001:print] dump_state ................... False
[2024-09-10 13:31:32,115] [INFO] [config.py:1001:print] dynamic_loss_scale_args ...... None
[2024-09-10 13:31:32,115] [INFO] [config.py:1001:print] eigenvalue_enabled ........... False
[2024-09-10 13:31:32,115] [INFO] [config.py:1001:print] eigenvalue_gas_boundary_resolution 1
[2024-09-10 13:31:32,115] [INFO] [config.py:1001:print] eigenvalue_layer_name ........ bert.encoder.layer
[2024-09-10 13:31:32,115] [INFO] [config.py:1001:print] eigenvalue_layer_num ......... 0
[2024-09-10 13:31:32,115] [INFO] [config.py:1001:print] eigenvalue_max_iter .......... 100
[2024-09-10 13:31:32,115] [INFO] [config.py:1001:print] eigenvalue_stability ......... 1e-06
[2024-09-10 13:31:32,115] [INFO] [config.py:1001:print] eigenvalue_tol ............... 0.01
[2024-09-10 13:31:32,115] [INFO] [config.py:1001:print] eigenvalue_verbose ........... False
[2024-09-10 13:31:32,115] [INFO] [config.py:1001:print] elasticity_enabled ........... False
[2024-09-10 13:31:32,115] [INFO] [config.py:1001:print] flops_profiler_config ........ {
"enabled": false,
"recompute_fwd_factor": 0.0,
"profile_step": 1,
"module_depth": -1,
"top_modules": 1,
"detailed": true,
"output_file": null
}
[2024-09-10 13:31:32,115] [INFO] [config.py:1001:print] fp16_auto_cast ............... False
[2024-09-10 13:31:32,115] [INFO] [config.py:1001:print] fp16_enabled ................. True
[2024-09-10 13:31:32,116] [INFO] [config.py:1001:print] fp16_master_weights_and_gradients False
[2024-09-10 13:31:32,116] [INFO] [config.py:1001:print] global_rank .................. 0
[2024-09-10 13:31:32,116] [INFO] [config.py:1001:print] grad_accum_dtype ............. None
[2024-09-10 13:31:32,116] [INFO] [config.py:1001:print] gradient_accumulation_steps .. 1
[2024-09-10 13:31:32,116] [INFO] [config.py:1001:print] gradient_clipping ............ 0.1
[2024-09-10 13:31:32,116] [INFO] [config.py:1001:print] gradient_predivide_factor .... 1.0
[2024-09-10 13:31:32,116] [INFO] [config.py:1001:print] graph_harvesting ............. False
[2024-09-10 13:31:32,116] [INFO] [config.py:1001:print] hybrid_engine ................ enabled=False max_out_tokens=512 inference_tp_size=1 release_inference_cache=False pin_parameters=True tp_gather_partition_size=8
[2024-09-10 13:31:32,116] [INFO] [config.py:1001:print] initial_dynamic_scale ........ 65536
[2024-09-10 13:31:32,116] [INFO] [config.py:1001:print] load_universal_checkpoint .... False
[2024-09-10 13:31:32,116] [INFO] [config.py:1001:print] loss_scale ................... 0
[2024-09-10 13:31:32,116] [INFO] [config.py:1001:print] memory_breakdown ............. False
[2024-09-10 13:31:32,116] [INFO] [config.py:1001:print] mics_hierarchial_params_gather False
[2024-09-10 13:31:32,116] [INFO] [config.py:1001:print] mics_shard_size .............. -1
[2024-09-10 13:31:32,116] [INFO] [config.py:1001:print] monitor_config ............... tensorboard=TensorBoardConfig(enabled=False, output_path='', job_name='DeepSpeedJobName') comet=CometConfig(enabled=False, samples_log_interval=100, project=None, workspace=None, api_key=None, experiment_name=None, experiment_key=None, online=None, mode=None) wandb=WandbConfig(enabled=False, group=None, team=None, project='deepspeed') csv_monitor=CSVConfig(enabled=False, output_path='', job_name='DeepSpeedJobName') enabled=False
[2024-09-10 13:31:32,116] [INFO] [config.py:1001:print] nebula_config ................ {
"enabled": false,
"persistent_storage_path": null,
"persistent_time_interval": 100,
"num_of_version_in_retention": 2,
"enable_nebula_load": true,
"load_path": null
}
[2024-09-10 13:31:32,116] [INFO] [config.py:1001:print] optimizer_legacy_fusion ...... False
[2024-09-10 13:31:32,116] [INFO] [config.py:1001:print] optimizer_name ............... None
[2024-09-10 13:31:32,116] [INFO] [config.py:1001:print] optimizer_params ............. None
[2024-09-10 13:31:32,116] [INFO] [config.py:1001:print] pipeline ..................... {'stages': 'auto', 'partition': 'best', 'seed_layers': False, 'activation_checkpoint_interval': 0, 'pipe_partitioned': True, 'grad_partitioned': True}
[2024-09-10 13:31:32,116] [INFO] [config.py:1001:print] pld_enabled .................. False
[2024-09-10 13:31:32,116] [INFO] [config.py:1001:print] pld_params ................... False
[2024-09-10 13:31:32,116] [INFO] [config.py:1001:print] prescale_gradients ........... False
[2024-09-10 13:31:32,116] [INFO] [config.py:1001:print] scheduler_name ............... None
[2024-09-10 13:31:32,116] [INFO] [config.py:1001:print] scheduler_params ............. None
[2024-09-10 13:31:32,116] [INFO] [config.py:1001:print] seq_parallel_communication_data_type torch.float32
[2024-09-10 13:31:32,116] [INFO] [config.py:1001:print] sparse_attention ............. None
[2024-09-10 13:31:32,116] [INFO] [config.py:1001:print] sparse_gradients_enabled ..... False
[2024-09-10 13:31:32,116] [INFO] [config.py:1001:print] steps_per_print .............. 50
[2024-09-10 13:31:32,116] [INFO] [config.py:1001:print] timers_config ................ enabled=True synchronized=True
[2024-09-10 13:31:32,116] [INFO] [config.py:1001:print] train_batch_size ............. 1
[2024-09-10 13:31:32,116] [INFO] [config.py:1001:print] train_micro_batch_size_per_gpu 1
[2024-09-10 13:31:32,116] [INFO] [config.py:1001:print] use_data_before_expert_parallel_ False
[2024-09-10 13:31:32,116] [INFO] [config.py:1001:print] use_node_local_storage ....... False
[2024-09-10 13:31:32,116] [INFO] [config.py:1001:print] wall_clock_breakdown ......... False
[2024-09-10 13:31:32,116] [INFO] [config.py:1001:print] weight_quantization_config ... None
[2024-09-10 13:31:32,116] [INFO] [config.py:1001:print] world_size ................... 1
[2024-09-10 13:31:32,116] [INFO] [config.py:1001:print] zero_allow_untested_optimizer True
[2024-09-10 13:31:32,116] [INFO] [config.py:1001:print] zero_config .................. stage=2 contiguous_gradients=False reduce_scatter=True reduce_bucket_size=1000000000 use_multi_rank_bucket_allreduce=True allgather_partitions=True allgather_bucket_size=1000000000 overlap_comm=True load_from_fp32_weights=False elastic_checkpoint=False offload_param=None offload_optimizer=None sub_group_size=1,000,000,000 cpu_offload_param=None cpu_offload_use_pin_memory=None prefetch_bucket_size=50,000,000 param_persistence_threshold=100,000 model_persistence_threshold=sys.maxsize max_live_parameters=1,000,000,000 max_reuse_distance=1,000,000,000 gather_16bit_weights_on_model_save=False use_all_reduce_for_fetch_params=False stage3_gather_fp16_weights_on_model_save=False ignore_unused_parameters=True legacy_stage1=False round_robin_gradients=False zero_hpz_partition_size=1 zero_quantized_weights=False zero_quantized_nontrainable_weights=False zero_quantized_gradients=False mics_shard_size=-1 mics_hierarchical_params_gather=False memory_efficient_linear=True pipeline_loading_checkpoint=False override_module_apply=True
[2024-09-10 13:31:32,116] [INFO] [config.py:1001:print] zero_enabled ................. True
[2024-09-10 13:31:32,116] [INFO] [config.py:1001:print] zero_force_ds_cpu_optimizer .. True
[2024-09-10 13:31:32,116] [INFO] [config.py:1001:print] zero_optimization_stage ...... 2
[2024-09-10 13:31:32,117] [INFO] [config.py:987:print_user_config] json = {
"train_micro_batch_size_per_gpu": 1,
"gradient_accumulation_steps": 1,
"steps_per_print": 50,
"gradient_clipping": 0.1,
"zero_optimization": {
"stage": 2,
"cpu_offload": false,
"contiguous_gradients": false,
"overlap_comm": true,
"reduce_scatter": true,
"reduce_bucket_size": 1.000000e+09,
"allgather_bucket_size": 1.000000e+09,
"load_from_fp32_weights": false
},
"zero_allow_untested_optimizer": true,
"bf16": {
"enabled": false
},
"fp16": {
"enabled": true
},
"loss_scale": 0,
"loss_scale_window": 400,
"hysteresis": 2,
"min_loss_scale": 1,
"activation_checkpointing": {
"partition_activations": false,
"contiguous_memory_optimization": false
},
"wall_clock_breakdown": false
}
[2024-09-10 13:31:32,117] [INFO] [RANK 0] learning rate decaying style linear, ratio 10.0
[2024-09-10 13:31:32,117] [INFO] [RANK 0] Finetuning Model...
[2024-09-10 13:31:32,117] [INFO] [RANK 0] arguments:
[2024-09-10 13:31:32,117] [INFO] [RANK 0] base ......................... ['configs/cogvideox_2b_lora.yaml', 'configs/sft.yaml']
[2024-09-10 13:31:32,117] [INFO] [RANK 0] model_parallel_size .......... 1
[2024-09-10 13:31:32,117] [INFO] [RANK 0] force_pretrain ............... False
[2024-09-10 13:31:32,117] [INFO] [RANK 0] device ....................... 0
[2024-09-10 13:31:32,117] [INFO] [RANK 0] debug ........................ False
[2024-09-10 13:31:32,117] [INFO] [RANK 0] log_image .................... True
[2024-09-10 13:31:32,117] [INFO] [RANK 0] output_dir ................... samples
[2024-09-10 13:31:32,117] [INFO] [RANK 0] input_dir .................... None
[2024-09-10 13:31:32,117] [INFO] [RANK 0] input_type ................... cli
[2024-09-10 13:31:32,117] [INFO] [RANK 0] input_file ................... input.txt
[2024-09-10 13:31:32,117] [INFO] [RANK 0] final_size ................... 2048
[2024-09-10 13:31:32,117] [INFO] [RANK 0] sdedit ....................... False
[2024-09-10 13:31:32,117] [INFO] [RANK 0] grid_num_rows ................ 1
[2024-09-10 13:31:32,117] [INFO] [RANK 0] force_inference .............. False
[2024-09-10 13:31:32,117] [INFO] [RANK 0] lcm_steps .................... None
[2024-09-10 13:31:32,117] [INFO] [RANK 0] sampling_num_frames .......... 32
[2024-09-10 13:31:32,117] [INFO] [RANK 0] sampling_fps ................. 8
[2024-09-10 13:31:32,117] [INFO] [RANK 0] only_save_latents ............ False
[2024-09-10 13:31:32,117] [INFO] [RANK 0] only_log_video_latents ....... True
[2024-09-10 13:31:32,117] [INFO] [RANK 0] latent_channels .............. 32
[2024-09-10 13:31:32,117] [INFO] [RANK 0] image2video .................. False
[2024-09-10 13:31:32,117] [INFO] [RANK 0] experiment_name .............. lora-test-09-10-13-30
[2024-09-10 13:31:32,117] [INFO] [RANK 0] train_iters .................. 100
[2024-09-10 13:31:32,117] [INFO] [RANK 0] batch_size ................... 1
[2024-09-10 13:31:32,117] [INFO] [RANK 0] lr ........................... 0.001
[2024-09-10 13:31:32,117] [INFO] [RANK 0] mode ......................... finetune
[2024-09-10 13:31:32,117] [INFO] [RANK 0] seed ......................... 27481
[2024-09-10 13:31:32,117] [INFO] [RANK 0] zero_stage ................... 0
[2024-09-10 13:31:32,117] [INFO] [RANK 0] checkpoint_activations ....... True
[2024-09-10 13:31:32,117] [INFO] [RANK 0] checkpoint_num_layers ........ 1
[2024-09-10 13:31:32,117] [INFO] [RANK 0] checkpoint_skip_layers ....... 0
[2024-09-10 13:31:32,118] [INFO] [RANK 0] fp16 ......................... True
[2024-09-10 13:31:32,118] [INFO] [RANK 0] bf16 ......................... False
[2024-09-10 13:31:32,118] [INFO] [RANK 0] gradient_accumulation_steps .. 1
[2024-09-10 13:31:32,118] [INFO] [RANK 0] profiling .................... -1
[2024-09-10 13:31:32,118] [INFO] [RANK 0] epochs ....................... None
[2024-09-10 13:31:32,118] [INFO] [RANK 0] log_interval ................. 20
[2024-09-10 13:31:32,118] [INFO] [RANK 0] summary_dir ..................
[2024-09-10 13:31:32,118] [INFO] [RANK 0] save_args .................... False
[2024-09-10 13:31:32,118] [INFO] [RANK 0] lr_decay_iters ............... None
[2024-09-10 13:31:32,118] [INFO] [RANK 0] lr_decay_style ............... linear
[2024-09-10 13:31:32,118] [INFO] [RANK 0] lr_decay_ratio ............... 0.1
[2024-09-10 13:31:32,118] [INFO] [RANK 0] warmup ....................... 0.01
[2024-09-10 13:31:32,118] [INFO] [RANK 0] weight_decay ................. 0.0001
[2024-09-10 13:31:32,118] [INFO] [RANK 0] save ......................... ckpts_2b_lora/lora-test-09-10-13-30
[2024-09-10 13:31:32,118] [INFO] [RANK 0] load ......................... /root/CogVideo/CogVideoX-2b-sat/transformer
[2024-09-10 13:31:32,118] [INFO] [RANK 0] force_train .................. True
[2024-09-10 13:31:32,118] [INFO] [RANK 0] save_interval ................ 50
[2024-09-10 13:31:32,118] [INFO] [RANK 0] no_save_rng .................. False
[2024-09-10 13:31:32,118] [INFO] [RANK 0] no_load_rng .................. True
[2024-09-10 13:31:32,118] [INFO] [RANK 0] resume_dataloader ............ False
[2024-09-10 13:31:32,118] [INFO] [RANK 0] distributed_backend .......... nccl
[2024-09-10 13:31:32,118] [INFO] [RANK 0] local_rank ................... 0
[2024-09-10 13:31:32,118] [INFO] [RANK 0] exit_interval ................ None
[2024-09-10 13:31:32,118] [INFO] [RANK 0] wandb ........................ False
[2024-09-10 13:31:32,118] [INFO] [RANK 0] wandb_project_name ........... default_project
[2024-09-10 13:31:32,118] [INFO] [RANK 0] eval_batch_size .............. 1
[2024-09-10 13:31:32,118] [INFO] [RANK 0] eval_iters ................... 1
[2024-09-10 13:31:32,118] [INFO] [RANK 0] eval_interval ................ 10
[2024-09-10 13:31:32,118] [INFO] [RANK 0] strict_eval .................. False
[2024-09-10 13:31:32,118] [INFO] [RANK 0] train_data ................... ['/root/CogVideo/sat/datasets/test']
[2024-09-10 13:31:32,118] [INFO] [RANK 0] train_data_weights ........... None
[2024-09-10 13:31:32,118] [INFO] [RANK 0] iterable_dataset ............. False
[2024-09-10 13:31:32,118] [INFO] [RANK 0] iterable_dataset_eval ........
[2024-09-10 13:31:32,118] [INFO] [RANK 0] batch_from_same_dataset ...... False
[2024-09-10 13:31:32,118] [INFO] [RANK 0] valid_data ................... ['/root/CogVideo/sat/datasets/test']
[2024-09-10 13:31:32,118] [INFO] [RANK 0] test_data .................... None
[2024-09-10 13:31:32,118] [INFO] [RANK 0] split ........................ 1,0,0
[2024-09-10 13:31:32,118] [INFO] [RANK 0] num_workers .................. 8
[2024-09-10 13:31:32,118] [INFO] [RANK 0] block_size ................... 10000
[2024-09-10 13:31:32,118] [INFO] [RANK 0] prefetch_factor .............. 4
[2024-09-10 13:31:32,118] [INFO] [RANK 0] deepspeed .................... True
[2024-09-10 13:31:32,118] [INFO] [RANK 0] deepspeed_config ............. {'train_micro_batch_size_per_gpu': 1, 'gradient_accumulation_steps': 1, 'steps_per_print': 50, 'gradient_clipping': 0.1, 'zero_optimization': {'stage': 2, 'cpu_offload': False, 'contiguous_gradients': False, 'overlap_comm': True, 'reduce_scatter': True, 'reduce_bucket_size': 1000000000, 'allgather_bucket_size': 1000000000, 'load_from_fp32_weights': False}, 'zero_allow_untested_optimizer': True, 'bf16': {'enabled': False}, 'fp16': {'enabled': True}, 'loss_scale': 0, 'loss_scale_window': 400, 'hysteresis': 2, 'min_loss_scale': 1, 'activation_checkpointing': {'partition_activations': False, 'contiguous_memory_optimization': False}, 'wall_clock_breakdown': False}
[2024-09-10 13:31:32,118] [INFO] [RANK 0] deepscale .................... False
[2024-09-10 13:31:32,118] [INFO] [RANK 0] deepscale_config ............. None
[2024-09-10 13:31:32,119] [INFO] [RANK 0] model_config ................. {'scale_factor': 1.15258426, 'disable_first_stage_autocast': True, 'not_trainable_prefixes': ['all'], 'log_keys': ['txt'], 'denoiser_config': {'target': 'sgm.modules.diffusionmodules.denoiser.DiscreteDenoiser', 'params': {'num_idx': 1000, 'quantize_c_noise': False, 'weighting_config': {'target': 'sgm.modules.diffusionmodules.denoiser_weighting.EpsWeighting'}, 'scaling_config': {'target': 'sgm.modules.diffusionmodules.denoiser_scaling.VideoScaling'}, 'discretization_config': {'target': 'sgm.modules.diffusionmodules.discretizer.ZeroSNRDDPMDiscretization', 'params': {'shift_scale': 3.0}}}}, 'network_config': {'target': 'dit_video_concat.DiffusionTransformer', 'params': {'time_embed_dim': 512, 'elementwise_affine': True, 'num_frames': 49, 'time_compressed_rate': 4, 'latent_width': 90, 'latent_height': 60, 'num_layers': 30, 'patch_size': 2, 'in_channels': 16, 'out_channels': 16, 'hidden_size': 1920, 'adm_in_channels': 256, 'num_attention_heads': 30, 'transformer_args': {'checkpoint_activations': True, 'vocab_size': 1, 'max_sequence_length': 64, 'layernorm_order': 'pre', 'skip_init': False, 'model_parallel_size': 1, 'is_decoder': False, 'num_layers': 30, 'hidden_size': 1920, 'num_attention_heads': 30, 'parallel_output': True}, 'modules': {'pos_embed_config': {'target': 'dit_video_concat.Basic3DPositionEmbeddingMixin', 'params': {'text_length': 226, 'height_interpolation': 1.875, 'width_interpolation': 1.875}}, 'lora_config': {'target': 'sat.model.finetune.lora2.LoraMixin', 'params': {'r': 128}}, 'patch_embed_config': {'target': 'dit_video_concat.ImagePatchEmbeddingMixin', 'params': {'text_hidden_size': 4096}}, 'adaln_layer_config': {'target': 'dit_video_concat.AdaLNMixin', 'params': {'qk_ln': True}}, 'final_layer_config': {'target': 'dit_video_concat.FinalLayerMixin'}}, 'dtype': 'fp16'}}, 'conditioner_config': {'target': 'sgm.modules.GeneralConditioner', 'params': {'emb_models': [{'is_trainable': False, 'input_key': 'txt', 'ucg_rate': 0.1, 'target': 'sgm.modules.encoders.modules.FrozenT5Embedder', 'params': {'model_dir': '/root/CogVideo/t5-v1_1-xxl', 'max_length': 226}}]}}, 'first_stage_config': {'target': 'vae_modules.autoencoder.VideoAutoencoderInferenceWrapper', 'params': {'cp_size': 1, 'ckpt_path': '/root/CogVideo/CogVideoX-2b-sat/vae/3d-vae.pt', 'ignore_keys': ['loss'], 'loss_config': {'target': 'torch.nn.Identity'}, 'regularizer_config': {'target': 'vae_modules.regularizers.DiagonalGaussianRegularizer'}, 'encoder_config': {'target': 'vae_modules.cp_enc_dec.ContextParallelEncoder3D', 'params': {'double_z': True, 'z_channels': 16, 'resolution': 256, 'in_channels': 3, 'out_ch': 3, 'ch': 128, 'ch_mult': [1, 2, 2, 4], 'attn_resolutions': [], 'num_res_blocks': 3, 'dropout': 0.0, 'gather_norm': True}}, 'decoder_config': {'target': 'vae_modules.cp_enc_dec.ContextParallelDecoder3D', 'params': {'double_z': True, 'z_channels': 16, 'resolution': 256, 'in_channels': 3, 'out_ch': 3, 'ch': 128, 'ch_mult': [1, 2, 2, 4], 'attn_resolutions': [], 'num_res_blocks': 3, 'dropout': 0.0, 'gather_norm': False}}}}, 'loss_fn_config': {'target': 'sgm.modules.diffusionmodules.loss.VideoDiffusionLoss', 'params': {'offset_noise_level': 0, 'sigma_sampler_config': {'target': 'sgm.modules.diffusionmodules.sigma_sampling.DiscreteSampling', 'params': {'uniform_sampling': True, 'num_idx': 1000, 'discretization_config': {'target': 'sgm.modules.diffusionmodules.discretizer.ZeroSNRDDPMDiscretization', 'params': {'shift_scale': 3.0}}}}}}, 'sampler_config': {'target': 'sgm.modules.diffusionmodules.sampling.VPSDEDPMPP2MSampler', 'params': {'num_steps': 50, 'verbose': True, 'discretization_config': {'target': 'sgm.modules.diffusionmodules.discretizer.ZeroSNRDDPMDiscretization', 'params': {'shift_scale': 3.0}}, 'guider_config': {'target': 'sgm.modules.diffusionmodules.guiders.DynamicCFG', 'params': {'scale': 6, 'exp': 5, 'num_steps': 50}}}}}
[2024-09-10 13:31:32,119] [INFO] [RANK 0] data_config .................. {'target': 'data_video.SFTDataset', 'params': {'video_size': [480, 720], 'fps': 8, 'max_num_frames': 49, 'skip_frms_num': 3.0}}
[2024-09-10 13:31:32,119] [INFO] [RANK 0] cuda ......................... True
[2024-09-10 13:31:32,119] [INFO] [RANK 0] rank ......................... 0
[2024-09-10 13:31:32,119] [INFO] [RANK 0] world_size ................... 1
[2024-09-10 13:31:32,119] [INFO] [RANK 0] deepspeed_activation_checkpointing True
[2024-09-10 13:31:32,119] [INFO] [RANK 0] master_ip .................... localhost
[2024-09-10 13:31:32,119] [INFO] [RANK 0] master_port .................. 44107
[2024-09-10 13:31:32,119] [INFO] [RANK 0] log_config ................... [{'model': {'scale_factor': 1.15258426, 'disable_first_stage_autocast': True, 'not_trainable_prefixes': ['all'], 'log_keys': ['txt'], 'denoiser_config': {'target': 'sgm.modules.diffusionmodules.denoiser.DiscreteDenoiser', 'params': {'num_idx': 1000, 'quantize_c_noise': False, 'weighting_config': {'target': 'sgm.modules.diffusionmodules.denoiser_weighting.EpsWeighting'}, 'scaling_config': {'target': 'sgm.modules.diffusionmodules.denoiser_scaling.VideoScaling'}, 'discretization_config': {'target': 'sgm.modules.diffusionmodules.discretizer.ZeroSNRDDPMDiscretization', 'params': {'shift_scale': 3.0}}}}, 'network_config': {'target': 'dit_video_concat.DiffusionTransformer', 'params': {'time_embed_dim': 512, 'elementwise_affine': True, 'num_frames': 49, 'time_compressed_rate': 4, 'latent_width': 90, 'latent_height': 60, 'num_layers': 30, 'patch_size': 2, 'in_channels': 16, 'out_channels': 16, 'hidden_size': 1920, 'adm_in_channels': 256, 'num_attention_heads': 30, 'transformer_args': {'checkpoint_activations': True, 'vocab_size': 1, 'max_sequence_length': 64, 'layernorm_order': 'pre', 'skip_init': False, 'model_parallel_size': 1, 'is_decoder': False}, 'modules': {'pos_embed_config': {'target': 'dit_video_concat.Basic3DPositionEmbeddingMixin', 'params': {'text_length': 226, 'height_interpolation': 1.875, 'width_interpolation': 1.875}}, 'lora_config': {'target': 'sat.model.finetune.lora2.LoraMixin', 'params': {'r': 128}}, 'patch_embed_config': {'target': 'dit_video_concat.ImagePatchEmbeddingMixin', 'params': {'text_hidden_size': 4096}}, 'adaln_layer_config': {'target': 'dit_video_concat.AdaLNMixin', 'params': {'qk_ln': True}}, 'final_layer_config': {'target': 'dit_video_concat.FinalLayerMixin'}}}}, 'conditioner_config': {'target': 'sgm.modules.GeneralConditioner', 'params': {'emb_models': [{'is_trainable': False, 'input_key': 'txt', 'ucg_rate': 0.1, 'target': 'sgm.modules.encoders.modules.FrozenT5Embedder', 'params': {'model_dir': '/root/CogVideo/t5-v1_1-xxl', 'max_length': 226}}]}}, 'first_stage_config': {'target': 'vae_modules.autoencoder.VideoAutoencoderInferenceWrapper', 'params': {'cp_size': 1, 'ckpt_path': '/root/CogVideo/CogVideoX-2b-sat/vae/3d-vae.pt', 'ignore_keys': ['loss'], 'loss_config': {'target': 'torch.nn.Identity'}, 'regularizer_config': {'target': 'vae_modules.regularizers.DiagonalGaussianRegularizer'}, 'encoder_config': {'target': 'vae_modules.cp_enc_dec.ContextParallelEncoder3D', 'params': {'double_z': True, 'z_channels': 16, 'resolution': 256, 'in_channels': 3, 'out_ch': 3, 'ch': 128, 'ch_mult': [1, 2, 2, 4], 'attn_resolutions': [], 'num_res_blocks': 3, 'dropout': 0.0, 'gather_norm': True}}, 'decoder_config': {'target': 'vae_modules.cp_enc_dec.ContextParallelDecoder3D', 'params': {'double_z': True, 'z_channels': 16, 'resolution': 256, 'in_channels': 3, 'out_ch': 3, 'ch': 128, 'ch_mult': [1, 2, 2, 4], 'attn_resolutions': [], 'num_res_blocks': 3, 'dropout': 0.0, 'gather_norm': False}}}}, 'loss_fn_config': {'target': 'sgm.modules.diffusionmodules.loss.VideoDiffusionLoss', 'params': {'offset_noise_level': 0, 'sigma_sampler_config': {'target': 'sgm.modules.diffusionmodules.sigma_sampling.DiscreteSampling', 'params': {'uniform_sampling': True, 'num_idx': 1000, 'discretization_config': {'target': 'sgm.modules.diffusionmodules.discretizer.ZeroSNRDDPMDiscretization', 'params': {'shift_scale': 3.0}}}}}}, 'sampler_config': {'target': 'sgm.modules.diffusionmodules.sampling.VPSDEDPMPP2MSampler', 'params': {'num_steps': 50, 'verbose': True, 'discretization_config': {'target': 'sgm.modules.diffusionmodules.discretizer.ZeroSNRDDPMDiscretization', 'params': {'shift_scale': 3.0}}, 'guider_config': {'target': 'sgm.modules.diffusionmodules.guiders.DynamicCFG', 'params': {'scale': 6, 'exp': 5, 'num_steps': 50}}}}}}, {'args': {'checkpoint_activations': True, 'model_parallel_size': 1, 'experiment_name': 'lora-test', 'mode': 'finetune', 'load': '/root/CogVideo/CogVideoX-2b-sat/transformer', 'no_load_rng': True, 'train_iters': 100, 'eval_iters': 1, 'eval_interval': 10, 'eval_batch_size': 1, 'save': 'ckpts_2b_lora', 'save_interval': 50, 'log_interval': 20, 'train_data': ['/root/CogVideo/sat/datasets/test'], 'valid_data': ['/root/CogVideo/sat/datasets/test'], 'split': '1,0,0', 'num_workers': 8, 'force_train': True, 'only_log_video_latents': True}, 'data': {'target': 'data_video.SFTDataset', 'params': {'video_size': [480, 720], 'fps': 8, 'max_num_frames': 49, 'skip_frms_num': 3.0}}, 'deepspeed': {'train_micro_batch_size_per_gpu': 1, 'gradient_accumulation_steps': 1, 'steps_per_print': 50, 'gradient_clipping': 0.1, 'zero_optimization': {'stage': 2, 'cpu_offload': False, 'contiguous_gradients': False, 'overlap_comm': True, 'reduce_scatter': True, 'reduce_bucket_size': 1000000000, 'allgather_bucket_size': 1000000000, 'load_from_fp32_weights': False}, 'zero_allow_untested_optimizer': True, 'bf16': {'enabled': False}, 'fp16': {'enabled': True}, 'loss_scale': 0, 'loss_scale_window': 400, 'hysteresis': 2, 'min_loss_scale': 1, 'optimizer': {'type': 'sat.ops.FusedEmaAdam', 'params': {'lr': 0.001, 'betas': [0.9, 0.95], 'eps': '1e-8', 'weight_decay': '1e-4'}}, 'activation_checkpointing': {'partition_activations': False, 'contiguous_memory_optimization': False}, 'wall_clock_breakdown': False}}]
[2024-09-10 13:31:32,119] [INFO] [RANK 0] do_train ..................... True
[2024-09-10 13:31:32,119] [INFO] [RANK 0] val_last_shape ............... []
[2024-09-10 13:31:32,119] [INFO] [RANK 0] val_drop_number .............. 0
[2024-09-10 13:31:32,119] [INFO] [RANK 0] do_valid ..................... True
[2024-09-10 13:31:32,119] [INFO] [RANK 0] do_test ...................... False
[2024-09-10 13:31:32,119] [INFO] [RANK 0] iteration .................... 0
[2024-09-10 13:32:00,623] [INFO] [checkpointing.py:541:forward] Activation Checkpointing Information
[2024-09-10 13:32:00,623] [INFO] [checkpointing.py:542:forward] ----Partition Activations False, CPU CHECKPOINTING False
[2024-09-10 13:32:00,623] [INFO] [checkpointing.py:543:forward] ----contiguous Memory Checkpointing False with None total layers
[2024-09-10 13:32:00,623] [INFO] [checkpointing.py:545:forward] ----Synchronization False
[2024-09-10 13:32:00,623] [INFO] [checkpointing.py:546:forward] ----Profiling time in checkpointing False
[2024-09-10 13:32:06,525] [INFO] [loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 4294967296, reducing to 2147483648
[2024-09-10 13:32:15,902] [INFO] [loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 2147483648, reducing to 1073741824
[2024-09-10 13:32:24,779] [INFO] [loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 1073741824, reducing to 536870912
[2024-09-10 13:32:33,800] [INFO] [loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 536870912, reducing to 268435456
[2024-09-10 13:32:43,291] [INFO] [loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 268435456, reducing to 134217728
[2024-09-10 13:33:28,030] [INFO] [loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 134217728, reducing to 67108864
/root/CogVideo/sat/train_video.py:67: DeprecationWarning: torch.get_autocast_gpu_dtype() is deprecated. Please use torch.get_autocast_dtype('cuda') instead. (Triggered internally at ../torch/csrc/autograd/init.cpp:733.)
"dtype": torch.get_autocast_gpu_dtype(),
/root/CogVideo/sat/train_video.py:70: FutureWarning: `torch.cuda.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cuda', args...)` instead.
with torch.no_grad(), torch.cuda.amp.autocast(**gpu_autocast_kwargs):
############################## Sampling setting ##############################
Sampler: VPSDEDPMPP2MSampler
Discretization: ZeroSNRDDPMDiscretization
Guider: DynamicCFG
Sampling with VPSDEDPMPP2MSampler for 51 steps: 98%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▏ | 50/51 [01:24<00:01, 1.70s/it]
[2024-09-10 13:34:59,554] [INFO] [RANK 0] ----------------------------------------------------------------------------------------------------
[2024-09-10 13:34:59,555] [INFO] [RANK 0] ----------------------------------------------------------------------------------------------
[2024-09-10 13:34:59,555] [INFO] [RANK 0] validation loss at iteration 10 | loss: 1.391026E-01 | PPL: 1.149242E+00 loss 1.391026E-01 |
[2024-09-10 13:34:59,555] [INFO] [RANK 0] ----------------------------------------------------------------------------------------------
[2024-09-10 13:35:16,965] [INFO] [loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 67108864, reducing to 33554432
[2024-09-10 13:36:28,365] [INFO] [RANK 0] iteration 20/ 100 | elapsed time per iteration (ms): 14758.9 | learning rate 5.000E-05 | total loss 1.892787E-01 | loss 1.892786E-01 | loss scale 33554432.0 |speed 4.07 samples/(min*GPU)
[2024-09-10 13:36:28,366] [INFO] [RANK 0] after 20 iterations memory (MB) | allocated: 13974.6455078125 | max allocated: 38562.94677734375 | cached: 18572.0 | max cached: 53186.0
[2024-09-10 13:36:28,367] [INFO] [RANK 0] time (ms) | forward: 4717.04 | backward: 5432.07 | allreduce: 0.00 | optimizer: 32.39 | data loader: 90.08
############################## Sampling setting ##############################
Sampler: VPSDEDPMPP2MSampler
Discretization: ZeroSNRDDPMDiscretization
Guider: DynamicCFG
Sampling with VPSDEDPMPP2MSampler for 51 steps: 98%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▏ | 50/51 [01:24<00:01, 1.70s/it]
[2024-09-10 13:37:59,450] [INFO] [RANK 0] ----------------------------------------------------------------------------------------------------
[2024-09-10 13:37:59,450] [INFO] [RANK 0] ----------------------------------------------------------------------------------------------
[2024-09-10 13:37:59,450] [INFO] [RANK 0] validation loss at iteration 20 | loss: 1.256772E-01 | PPL: 1.133916E+00 loss 1.256772E-01 |
[2024-09-10 13:37:59,450] [INFO] [RANK 0] ----------------------------------------------------------------------------------------------
############################## Sampling setting ##############################
Sampler: VPSDEDPMPP2MSampler
Discretization: ZeroSNRDDPMDiscretization
Guider: DynamicCFG
Sampling with VPSDEDPMPP2MSampler for 51 steps: 98%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▏ | 50/51 [01:25<00:01, 1.70s/it]
[2024-09-10 13:40:59,756] [INFO] [RANK 0] ----------------------------------------------------------------------------------------------------
[2024-09-10 13:40:59,756] [INFO] [RANK 0] ----------------------------------------------------------------------------------------------
[2024-09-10 13:40:59,756] [INFO] [RANK 0] validation loss at iteration 30 | loss: 2.129551E-01 | PPL: 1.237329E+00 loss 2.129551E-01 |
[2024-09-10 13:40:59,756] [INFO] [RANK 0] ----------------------------------------------------------------------------------------------
[rank0]: Traceback (most recent call last):
[rank0]: File "/root/CogVideo/sat/train_video.py", line 226, in <module>
[rank0]: training_main(
[rank0]: File "/root/miniconda3/envs/cogvideo/lib/python3.12/site-packages/sat/training/deepspeed_training.py", line 157, in training_main
[rank0]: iteration, skipped = train(model, optimizer,
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/root/miniconda3/envs/cogvideo/lib/python3.12/site-packages/sat/training/deepspeed_training.py", line 359, in train
[rank0]: lm_loss, skipped_iter, metrics = train_step(train_data_iterator,
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/root/miniconda3/envs/cogvideo/lib/python3.12/site-packages/sat/training/deepspeed_training.py", line 443, in train_step
[rank0]: forward_ret = forward_step(data_iterator, model, args, timers, **kwargs)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/root/CogVideo/sat/train_video.py", line 176, in forward_step
[rank0]: batch = next(data_iterator)
[rank0]: ^^^^^^^^^^^^^^^^^^^
[rank0]: File "/root/miniconda3/envs/cogvideo/lib/python3.12/site-packages/torch/utils/data/dataloader.py", line 630, in __next__
[rank0]: data = self._next_data()
[rank0]: ^^^^^^^^^^^^^^^^^
[rank0]: File "/root/miniconda3/envs/cogvideo/lib/python3.12/site-packages/torch/utils/data/dataloader.py", line 1324, in _next_data
[rank0]: return self._process_data(data)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/root/miniconda3/envs/cogvideo/lib/python3.12/site-packages/torch/utils/data/dataloader.py", line 1370, in _process_data
[rank0]: data.reraise()
[rank0]: File "/root/miniconda3/envs/cogvideo/lib/python3.12/site-packages/torch/_utils.py", line 706, in reraise
[rank0]: raise exception
[rank0]: ZeroDivisionError: Caught ZeroDivisionError in DataLoader worker process 6.
[rank0]: Original Traceback (most recent call last):
[rank0]: File "/root/miniconda3/envs/cogvideo/lib/python3.12/site-packages/torch/utils/data/_utils/worker.py", line 309, in _worker_loop
[rank0]: data = fetcher.fetch(index) # type: ignore[possibly-undefined]
[rank0]: ^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/root/miniconda3/envs/cogvideo/lib/python3.12/site-packages/torch/utils/data/_utils/fetch.py", line 52, in fetch
[rank0]: data = [self.dataset[idx] for idx in possibly_batched_index]
[rank0]: ~~~~~~~~~~~~^^^^^
[rank0]: File "/root/miniconda3/envs/cogvideo/lib/python3.12/site-packages/sat/data_utils/configure_data.py", line 360, in __getitem__
[rank0]: return self.wrapped_data[index]
[rank0]: ~~~~~~~~~~~~~~~~~^^^^^^^
[rank0]: File "/root/miniconda3/envs/cogvideo/lib/python3.12/site-packages/sat/data_utils/configure_data.py", line 342, in __getitem__
[rank0]: return self.datasets[dataset_idx][sample_idx]
[rank0]: ~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^
[rank0]: File "/root/CogVideo/sat/data_video.py", line 411, in __getitem__
[rank0]: indices = np.arange(start, end, (end - start) // num_frames).astype(int)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: ZeroDivisionError: division by zero
DONE on alphacode-ttv-a100-80g-gpu
(cogvideo) root@alphacode-ttv-a100-80g-gpu:~/CogVideo/sat#
Same issue on A100 80G I tried 2b and 5b version (fp16 & bf16) Reduced rl from 1e-3 to 1e-5 (see https://github.com/THUDM/ChatGLM-6B/issues/1008) but same error
It is normal to skip when the loss is large at the beginning of training. You can find that a small number of steps will be skipped in the first 50 steps. Once the training is stable, it will not happen again.
Yes, that is right. @tengjiayan20 It recovered after few steps training:
[2024-09-11 17:52:11,320] [INFO] [checkpointing.py:546:forward] ----Profiling time in checkpointing False
[2024-09-11 17:52:18,030] [INFO] [loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 4294967296, reducing to 2147483648
[2024-09-11 17:52:32,563] [INFO] [loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 2147483648, reducing to 1073741824
[2024-09-11 17:52:47,082] [INFO] [loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 1073741824, reducing to 536870912
[2024-09-11 17:53:15,865] [INFO] [loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 536870912, reducing to 268435456
[2024-09-11 17:53:58,933] [INFO] [loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 268435456, reducing to 134217728
[2024-09-11 17:58:33,295] [INFO] [loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 134217728, reducing to 67108864
[2024-09-11 18:00:42,520] [INFO] [loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 67108864, reducing to 33554432
[2024-09-11 18:04:05,636] [INFO] [logging.py:96:log_dist] [Rank 0] step=50, skipped=7, lr=[5e-05], mom=[[0.9, 0.95]]
[2024-09-11 18:07:56,739] [INFO] [loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 33554432, reducing to 16777216
[2024-09-11 18:16:06,711] [INFO] [logging.py:96:log_dist] [Rank 0] step=100, skipped=8, lr=[5e-05], mom=[[0.9, 0.95]]
[2024-09-11 18:16:06,712] [INFO] [RANK 0] iteration 100/ 10000 | elapsed time per iteration (ms): 14623.5 | learning rate 5.000E-05 | total loss 1.992110E-01 | loss 1.992110E-01 | loss scale 16777216.0 |speed 8.21 samples/(min*GPU)
[2024-09-11 18:16:06,713] [INFO] [RANK 0] after 100 iterations memory (MB) | allocated: 13974.6455078125 | max allocated: 64453.90478515625 | cached: 22772.0 | max cached: 79914.0
[2024-09-11 18:16:06,713] [INFO] [RANK 0] time (ms) | forward: 9524.11 | backward: 5073.59 | allreduce: 0.00 | optimizer: 24.71 | data loader: 67.04
Thanks a lot.