Open-Sora icon indicating copy to clipboard operation
Open-Sora copied to clipboard

为什么我训练的时候,每个epoch非常快呐?就像没有没有正确加载数据一样?

Open xbyym opened this issue 1 year ago • 10 comments

[2024-06-29 05:36:29] Beginning epoch 0... Epoch 0: 0it [00:00, ?it/s] INFO: Pandarallel will run on 16 workers. INFO: Pandarallel will use Memory file system to transfer data between the main process and workers. INFO: Pandarallel will run on 16 workers. INFO: Pandarallel will use Memory file system to transfer data between the main process and workers. INFO: Pandarallel will run on 16 workers. INFO: Pandarallel will use Memory file system to transfer data between the main process and workers. [2024-06-29 05:36:30] Building buckets... INFO: Pandarallel will run on 16 workers. INFO: Pandarallel will use Memory file system to transfer data between the main process and workers. [2024-06-29 05:36:31] Bucket Info: [2024-06-29 05:36:31] Bucket [#sample, #batch] by aspect ratio: {'0.56': [160, 3]} [2024-06-29 05:36:31] Image Bucket [#sample, #batch] by HxWxT: {} [2024-06-29 05:36:31] Video Bucket [#sample, #batch] by HxWxT: {('144p', 51): [160, 3]} [2024-06-29 05:36:31] #training batch: 3, #training sample: 160, #non empty bucket: 1 [2024-06-29 05:36:31] Beginning epoch 1... Epoch 1: 0it [00:00, ?it/s]INFO: Pandarallel will run on 16 workers. INFO: Pandarallel will use Memory file system to transfer data between the main process and workers. Epoch 1: 0it [00:00, ?it/s] INFO: Pandarallel will run on 16 workers. INFO: Pandarallel will use Memory file system to transfer data between the main process and workers. INFO: Pandarallel will run on 16 workers. INFO: Pandarallel will use Memory file system to transfer data between the main process and workers. INFO: Pandarallel will run on 16 workers. INFO: Pandarallel will use Memory file system to transfer data between the main process and workers.

这是csv文件(我创建了几百个视频为了微调): path,text,id,relpath,num_frames,height,width,aspect_ratio,fps,resolution /home/yy/Open-Sora/clips/sample_0_scene-0.mp4,a dog is running,sample_0_scene-0,sample_0_scene-0.mp4,96.0,144.0,256.0,0.5625,24.0,36864.0 /home/yy/Open-Sora/clips/sample_1_scene-0.mp4,a dog is running,sample_1_scene-0,sample_1_scene-0.mp4,96.0,144.0,256.0,0.5625,24.0,36864.0 /home/yy/Open-Sora/clips/sample_2_scene-0.mp4,a dog is running,sample_2_scene-0,sample_2_scene-0.mp4,96.0,144.0,256.0,0.5625,24.0,36864.0 /home/yy/Open-Sora/clips/sample_3_scene-0.mp4,a dog is running,sample_3_scene-0,sample_3_scene-0.mp4,96.0,144.0,256.0,0.5625,24.0,36864.0

请问是我那里遗漏了吗?好像训练没有成功

xbyym avatar Jun 29 '24 06:06 xbyym

我也是这个问题

CIntellifusion avatar Jun 29 '24 12:06 CIntellifusion

我也是这个问题

batchsize 没满,最后一个drop_last默认丢弃,改为false就好了

xbyym avatar Jun 30 '24 03:06 xbyym

我也是这个问题

batchsize 没满,最后一个drop_last默认丢弃,改为false就好了

Thanks 但是我有两百个样本,batch_size=4,我目前怀疑是bucket和视频精度不匹配的问题。

CIntellifusion avatar Jun 30 '24 08:06 CIntellifusion

This issue is stale because it has been open for 7 days with no activity.

github-actions[bot] avatar Jul 08 '24 01:07 github-actions[bot]

@xbyym 可以在这一行https://github.com/hpcaitech/Open-Sora/blob/main/scripts/train.py#L264下面插入print(batch) 看看

FrankLeeeee avatar Jul 10 '24 02:07 FrankLeeeee

@xbyym 数据加载环节有过滤,需要根据自己数据的分布特点来设置bucket_config,代码段: https://github.com/hpcaitech/Open-Sora/blob/main/opensora/datasets/sampler.py#L200-L207
子函数:https://github.com/hpcaitech/Open-Sora/blob/main/opensora/datasets/bucket.py#L74-L120

AlphaNext avatar Jul 10 '24 13:07 AlphaNext

@xbyym 数据加载环节有过滤,需要根据自己数据的分布特点来设置bucket_config,代码段: https://github.com/hpcaitech/Open-Sora/blob/main/opensora/datasets/sampler.py#L200-L207 子函数:https://github.com/hpcaitech/Open-Sora/blob/main/opensora/datasets/bucket.py#L74-L120

谢谢~ 请问有什么设置bucket的说明吗 我遇到了一个问题是:视频帧长度不足51的会报错 不知道如何跳过或者设置

CIntellifusion avatar Jul 12 '24 04:07 CIntellifusion

This issue is stale because it has been open for 7 days with no activity.

github-actions[bot] avatar Jul 20 '24 01:07 github-actions[bot]

@xbyym 可以在这一行https://github.com/hpcaitech/Open-Sora/blob/main/scripts/train.py#L264下面插入print(batch) 看看

我也遇到了相同的问题,在大多数轮次时我print(batch)不包含任何数据,极个别epoch可以正常进行训练,这是为什么?

281LinChenjian avatar Jul 24 '24 02:07 281LinChenjian

为什么我的outputs只有两个日志和一个tensorboard,模型保存在哪儿了,貌似也没有覆盖传参时指定的ckpts啊;

是我-ckpt-path 配置有问题吗:

torchrun --standalone --nproc_per_node 4 scripts/train.py configs/opensora-v1-2/train/stage1.py --data-path /data02/Open-Sora/datasets0/webvid-10M/data_train_partitions_0000_100/meta/meta_clips_caption1.csv --ckpt-path /data02/Open-Sora/ckpts/PixArt-Sigma-XL-2-2K-MS.pth

layupgoat avatar Aug 02 '24 08:08 layupgoat

你好,方便给一个数据处理的镜像吗 做了几次了 还是跑不起来

AAwilliam avatar Sep 24 '24 09:09 AAwilliam

This issue is stale because it has been open for 7 days with no activity.

github-actions[bot] avatar Oct 02 '24 01:10 github-actions[bot]

This issue was closed because it has been inactive for 7 days since being marked as stale.

github-actions[bot] avatar Oct 10 '24 01:10 github-actions[bot]