The training is consistently getting stuck and is not proceeding.
The training is consistently getting stuck and is not proceeding. [2024-07-15 13:32:09] Preparing for distributed training... [2024-07-15 13:32:09] Boosting model for distributed training [2024-07-15 13:32:09] Training for 1000 epochs with 32425 steps per epoch [2024-07-15 13:32:11] Beginning epoch 0... Epoch 0: 0%| | 0/32425 [00:00<?, ?it/s]/mnt/vepfs/zhaomin/anaconda3/envs/ckh/lib/python3.9/site-packages/colossalai/nn/optimizer/nvme_optimizer.py:55: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage() numel += p.storage().size()
How long was it stuck? Can you try reducing batch size, or add more intermediate print() to ensure it is proceeding?
Take reference from our training report: https://github.com/hpcaitech/Open-Sora/blob/main/docs/report_03.md#more-data-and-better-multi-stage-training.
This issue is stale because it has been open for 7 days with no activity.
The training phase is stuck here: [extension] Compiling the JIT cpu_adam_x86 kernel during runtime now
This issue is stale because it has been open for 7 days with no activity.
This issue was closed because it has been inactive for 7 days since being marked as stale.