Open-Sora icon indicating copy to clipboard operation
Open-Sora copied to clipboard

The training is consistently getting stuck and is not proceeding.

Open gracezhao1997 opened this issue 1 year ago • 3 comments

The training is consistently getting stuck and is not proceeding. [2024-07-15 13:32:09] Preparing for distributed training... [2024-07-15 13:32:09] Boosting model for distributed training [2024-07-15 13:32:09] Training for 1000 epochs with 32425 steps per epoch [2024-07-15 13:32:11] Beginning epoch 0... Epoch 0: 0%| | 0/32425 [00:00<?, ?it/s]/mnt/vepfs/zhaomin/anaconda3/envs/ckh/lib/python3.9/site-packages/colossalai/nn/optimizer/nvme_optimizer.py:55: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage() numel += p.storage().size()

gracezhao1997 avatar Jul 15 '24 14:07 gracezhao1997

How long was it stuck? Can you try reducing batch size, or add more intermediate print() to ensure it is proceeding?

Take reference from our training report: https://github.com/hpcaitech/Open-Sora/blob/main/docs/report_03.md#more-data-and-better-multi-stage-training.

JThh avatar Jul 17 '24 00:07 JThh

This issue is stale because it has been open for 7 days with no activity.

github-actions[bot] avatar Jul 24 '24 01:07 github-actions[bot]

The training phase is stuck here: [extension] Compiling the JIT cpu_adam_x86 kernel during runtime now

gracezhao1997 avatar Jul 27 '24 06:07 gracezhao1997

This issue is stale because it has been open for 7 days with no activity.

github-actions[bot] avatar Sep 02 '24 01:09 github-actions[bot]

This issue was closed because it has been inactive for 7 days since being marked as stale.

github-actions[bot] avatar Sep 11 '24 01:09 github-actions[bot]