Open-Sora 为什么我训练的时候，每个epoch非常快呐？就像没有没有正确加载数据一样？

[2024-06-29 05:36:29] Beginning epoch 0... Epoch 0: 0it [00:00, ?it/s] INFO: Pandarallel will run on 16 workers. INFO: Pandarallel will use Memory file system to transfer data between the main process and workers. INFO: Pandarallel will run on 16 workers. INFO: Pandarallel will use Memory file system to transfer data between the main process and workers. INFO: Pandarallel will run on 16 workers. INFO: Pandarallel will use Memory file system to transfer data between the main process and workers. [2024-06-29 05:36:30] Building buckets... INFO: Pandarallel will run on 16 workers. INFO: Pandarallel will use Memory file system to transfer data between the main process and workers. [2024-06-29 05:36:31] Bucket Info: [2024-06-29 05:36:31] Bucket [#sample, #batch] by aspect ratio: {'0.56': [160, 3]} [2024-06-29 05:36:31] Image Bucket [#sample, #batch] by HxWxT: {} [2024-06-29 05:36:31] Video Bucket [#sample, #batch] by HxWxT: {('144p', 51): [160, 3]} [2024-06-29 05:36:31] #training batch: 3, #training sample: 160, #non empty bucket: 1 [2024-06-29 05:36:31] Beginning epoch 1... Epoch 1: 0it [00:00, ?it/s]INFO: Pandarallel will run on 16 workers. INFO: Pandarallel will use Memory file system to transfer data between the main process and workers. Epoch 1: 0it [00:00, ?it/s] INFO: Pandarallel will run on 16 workers. INFO: Pandarallel will use Memory file system to transfer data between the main process and workers. INFO: Pandarallel will run on 16 workers. INFO: Pandarallel will use Memory file system to transfer data between the main process and workers. INFO: Pandarallel will run on 16 workers. INFO: Pandarallel will use Memory file system to transfer data between the main process and workers.

这是csv文件（我创建了几百个视频为了微调）： path,text,id,relpath,num_frames,height,width,aspect_ratio,fps,resolution /home/yy/Open-Sora/clips/sample_0_scene-0.mp4,a dog is running,sample_0_scene-0,sample_0_scene-0.mp4,96.0,144.0,256.0,0.5625,24.0,36864.0 /home/yy/Open-Sora/clips/sample_1_scene-0.mp4,a dog is running,sample_1_scene-0,sample_1_scene-0.mp4,96.0,144.0,256.0,0.5625,24.0,36864.0 /home/yy/Open-Sora/clips/sample_2_scene-0.mp4,a dog is running,sample_2_scene-0,sample_2_scene-0.mp4,96.0,144.0,256.0,0.5625,24.0,36864.0 /home/yy/Open-Sora/clips/sample_3_scene-0.mp4,a dog is running,sample_3_scene-0,sample_3_scene-0.mp4,96.0,144.0,256.0,0.5625,24.0,36864.0

请问是我那里遗漏了吗？好像训练没有成功

Jun 29 '24 06:06 xbyym

我也是这个问题

Jun 29 '24 12:06 CIntellifusion

我也是这个问题

batchsize 没满，最后一个drop_last默认丢弃，改为false就好了

Jun 30 '24 03:06 xbyym

我也是这个问题

batchsize 没满，最后一个drop_last默认丢弃，改为false就好了

Thanks 但是我有两百个样本，batch_size=4,我目前怀疑是bucket和视频精度不匹配的问题。

Jun 30 '24 08:06 CIntellifusion

This issue is stale because it has been open for 7 days with no activity.

Jul 08 '24 01:07 github-actions[bot]

@xbyym 可以在这一行https://github.com/hpcaitech/Open-Sora/blob/main/scripts/train.py#L264下面插入print(batch) 看看

Jul 10 '24 02:07 FrankLeeeee

@xbyym 数据加载环节有过滤，需要根据自己数据的分布特点来设置bucket_config，代码段： https://github.com/hpcaitech/Open-Sora/blob/main/opensora/datasets/sampler.py#L200-L207
子函数：https://github.com/hpcaitech/Open-Sora/blob/main/opensora/datasets/bucket.py#L74-L120

Jul 10 '24 13:07 AlphaNext

@xbyym 数据加载环节有过滤，需要根据自己数据的分布特点来设置bucket_config，代码段： https://github.com/hpcaitech/Open-Sora/blob/main/opensora/datasets/sampler.py#L200-L207 子函数：https://github.com/hpcaitech/Open-Sora/blob/main/opensora/datasets/bucket.py#L74-L120

谢谢~ 请问有什么设置bucket的说明吗我遇到了一个问题是：视频帧长度不足51的会报错不知道如何跳过或者设置

Jul 12 '24 04:07 CIntellifusion

This issue is stale because it has been open for 7 days with no activity.

Jul 20 '24 01:07 github-actions[bot]

@xbyym 可以在这一行https://github.com/hpcaitech/Open-Sora/blob/main/scripts/train.py#L264下面插入print(batch) 看看

我也遇到了相同的问题，在大多数轮次时我print(batch)不包含任何数据，极个别epoch可以正常进行训练，这是为什么？

Jul 24 '24 02:07 281LinChenjian

为什么我的outputs只有两个日志和一个tensorboard，模型保存在哪儿了，貌似也没有覆盖传参时指定的ckpts啊；

是我-ckpt-path 配置有问题吗：

torchrun --standalone --nproc_per_node 4 scripts/train.py configs/opensora-v1-2/train/stage1.py --data-path /data02/Open-Sora/datasets0/webvid-10M/data_train_partitions_0000_100/meta/meta_clips_caption1.csv --ckpt-path /data02/Open-Sora/ckpts/PixArt-Sigma-XL-2-2K-MS.pth

Aug 02 '24 08:08 layupgoat

你好，方便给一个数据处理的镜像吗做了几次了还是跑不起来

Sep 24 '24 09:09 AAwilliam

This issue is stale because it has been open for 7 days with no activity.

Oct 02 '24 01:10 github-actions[bot]

This issue was closed because it has been inactive for 7 days since being marked as stale.

Oct 10 '24 01:10 github-actions[bot]

Open-Sora Open-Sora copied to clipboard

为什么我训练的时候，每个epoch非常快呐？就像没有没有正确加载数据一样？

Open-Sora
Open-Sora copied to clipboard