Open-Sora icon indicating copy to clipboard operation
Open-Sora copied to clipboard

[bug]IndexError: list index out of range

Open leonardodora opened this issue 1 year ago • 4 comments
trafficstars

4091721789744_ pic

您好,我们用自己的数据集多机训练,但resume训练时有时候可以正常跑起来,有时候又会报这个错,麻烦看看可能是什么问题呢。 另外相同的pretraind model,如果模型放在config的load里会报错,但放在model的pretraind model就可以正常跑起来。我怀疑可能是和当时的step使用的bucket数据有关

leonardodora avatar Jul 24 '24 03:07 leonardodora

This issue is stale because it has been open for 7 days with no activity.

github-actions[bot] avatar Aug 01 '24 01:08 github-actions[bot]

I suspect the num_frames or resolution of videos in your training corpus exceeds the one specified in your config, hence leading to the bucket 'overflow'.

It would be great if @zhengzangw could kindly provide some intuition.

JThh avatar Aug 07 '24 04:08 JThh

The problem is that when you load from a checkpoint, it will load the number of buckets you use. For example, in pertaining, you reach step 100k. However, the fine-tuning dataset is small and does not have so many batches. Then this issue happens.

The easiest way to solve it is to pass --start-from-scratch with --load.

https://github.com/hpcaitech/Open-Sora/blob/476b6dc79720e5d9ddfb3cd589680b2308871926/opensora/utils/config_utils.py#L80

zhengzangw avatar Aug 07 '24 05:08 zhengzangw

This issue is stale because it has been open for 7 days with no activity.

github-actions[bot] avatar Aug 17 '24 01:08 github-actions[bot]

This issue was closed because it has been inactive for 7 days since being marked as stale.

github-actions[bot] avatar Aug 24 '24 01:08 github-actions[bot]