Open-Sora The training is consistently getting stuck and is not proceeding.

The training is consistently getting stuck and is not proceeding. [2024-07-15 13:32:09] Preparing for distributed training... [2024-07-15 13:32:09] Boosting model for distributed training [2024-07-15 13:32:09] Training for 1000 epochs with 32425 steps per epoch [2024-07-15 13:32:11] Beginning epoch 0... Epoch 0: 0%| | 0/32425 [00:00<?, ?it/s]/mnt/vepfs/zhaomin/anaconda3/envs/ckh/lib/python3.9/site-packages/colossalai/nn/optimizer/nvme_optimizer.py:55: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage() numel += p.storage().size()

Jul 15 '24 14:07 gracezhao1997

How long was it stuck? Can you try reducing batch size, or add more intermediate print() to ensure it is proceeding?

Take reference from our training report: https://github.com/hpcaitech/Open-Sora/blob/main/docs/report_03.md#more-data-and-better-multi-stage-training.

Jul 17 '24 00:07 JThh

This issue is stale because it has been open for 7 days with no activity.

Jul 24 '24 01:07 github-actions[bot]

The training phase is stuck here: [extension] Compiling the JIT cpu_adam_x86 kernel during runtime now

Jul 27 '24 06:07 gracezhao1997

This issue is stale because it has been open for 7 days with no activity.

Sep 02 '24 01:09 github-actions[bot]

This issue was closed because it has been inactive for 7 days since being marked as stale.

Sep 11 '24 01:09 github-actions[bot]