CogVideo icon indicating copy to clipboard operation
CogVideo copied to clipboard

Finetuning CogVideoX-t2v-5B takes a very long time, even on 8xH100 GPUs

Open xabirizar9 opened this issue 8 months ago • 7 comments

Hi,

i'm trying to finetune CogVideoX-t2v-5B using LoRA in a DDP strategy with 8 H100s but its taking very long.

I'm running the train_ddp_t2v script. I've followed the best practices for the data and tried increasing batch size and num_workers with no successful results. Increasing batch size beyond 2 results in OOM errors.

I'm using a filtered version of OpenVid-1M dataset, with the goal of finetuning on approx. 250k samples. For initial testing, I tried with just 70 samples, which took 30mins to finetune on 10 epochs, and then increased to 5k samples, taking around 38h (didn't complete the training because this seemed too large to me). I was initially feeding at max resolution @ 81x768x1360, and lowering the resolution to 81x480x720 brought the training down to ~8h. Additionally, i observe that the loss with the given finetuning script (this is without modifying batch size or learning rate) is noisy and doesn't go down.

It seems like one of the main bottlenecks is the initial data preprocessing. The GPUs appear to be pretty lightly loaded. They only load one sample at a time on each GPU, and taking a few seconds to process each sample, at best I can preprocess 16 samples per minute @ 81x768x1360. At this rate, it would take 11 days to process all 250k samples, which just for the preprocessing is very long. Let alone the finetuning afterwards.

Using LoRA we're not even updating that many weights, so is there something i'm getting fundamentally wrong? Is there some way to speed up preprocessing, and consequently, finetuning? Or is this just how long it takes to fine-tune this kind of model?

Would appreciate any insight in the right direction, thank you!

xabirizar9 avatar Apr 22 '25 07:04 xabirizar9

In our tests, with a resolution of 81x768x1360, the batch size indeed cannot exceed 2, because for video generation models, the sequence length far exceeds that of normal models.

Additionally, LoRA does not inherently reduce the computational load during training. For the scenario you mentioned—70 samples over 10 epochs, totaling 700 samples at 81x768x1360 resolution—requiring 30 minutes for fine-tuning is considered normal.

OleehyO avatar Apr 23 '25 02:04 OleehyO

When fine-tuning CogVideoX-5B on my own dataset, I've also encountered the same problem where the loss is noisy and doesn't go down. Have you discovered what the issue might be?

123lcy123 avatar Apr 23 '25 07:04 123lcy123

Thanks for your answer @OleehyO . If I wanted to conduct some larger scale finetuning, say with 150k samples, is this just a cost I have to assume that at that resolution, it would take roughly 1000 hours? I noticed that lowering resolution significantly reduces training time by 7x, which seems a more manageable time. Can you provide more insights into this please? Thank you!

Same question goes for the initial caching of prompts/video latents.

xabirizar9 avatar Apr 23 '25 10:04 xabirizar9

When fine-tuning CogVideoX-5B on my own dataset, I've also encountered the same problem where the loss is noisy and doesn't go down. Have you discovered what the issue might be?

Same problem! The loss is nosiy and doesn't go down. I try to adjust the learning rate (1e-4, 2e-5), scheduler (cos with restart, constant), and batch size (32->64), but they all fail. Is there anyone who can help?

I try to full finetuning on my own dataset (~10w),cos_with_restart, 1e-4, 48000 step, bs 32. The loss is always shaking between 0.1 and 0.5

OwalnutO avatar Apr 24 '25 01:04 OwalnutO

@123lcy123 @OwalnutO, does the loss fluctuate continuously from the start to the end of training? Or does it decrease at the beginning and then fluctuate within a large range? If it's the latter, I think it's normal, but if the loss oscillates between 0.1~0.5, that's indeed a bit strange because it's quite large. It's recommended to expand the dataset or further increase the batch size.

OleehyO avatar Apr 24 '25 04:04 OleehyO

Thanks for your answer @OleehyO. If I wanted to conduct some larger scale finetuning, say with 150k samples, is this just a cost I have to assume that at that resolution, it would take roughly 1000 hours? I noticed that lowering resolution significantly reduces training time by 7x, which seems a more manageable time. Can you provide more insights into this please? Thank you!

The same question applies to the initial caching of prompts/video latents.

For larger-scale training, it is recommended to use more professional training frameworks like Megatron, as our provided training scripts are only intended for small-scale fine-tuning. Training on large-scale datasets may require a very long time.

After reducing from 81x768x1360 to 81x480x720, since the sequence length becomes 1/3 of the original, the computational load for the attention module (O(n^2)) theoretically decreases by 1/9, while other modules decrease by 1/3, so a 7-8 times reduction in time is considered normal. The same logic applies to reducing the number of video frames. However, for large-scale training, it is still recommended to use more professional frameworks.

OleehyO avatar Apr 24 '25 04:04 OleehyO

Thanks for your answer @OleehyO. If I wanted to conduct some larger scale finetuning, say with 150k samples, is this just a cost I have to assume that at that resolution, it would take roughly 1000 hours? I noticed that lowering resolution significantly reduces training time by 7x, which seems a more manageable time. Can you provide more insights into this please? Thank you! The same question applies to the initial caching of prompts/video latents.

For larger-scale training, it is recommended to use more professional training frameworks like Megatron, as our provided training scripts are only intended for small-scale fine-tuning. Training on large-scale datasets may require a very long time.

After reducing from 81x768x1360 to 81x480x720, since the sequence length becomes 1/3 of the original, the computational load for the attention module (O(n^2)) theoretically decreases by 1/9, while other modules decrease by 1/3, so a 7-8 times reduction in time is considered normal. The same logic applies to reducing the number of video frames. However, for large-scale training, it is still recommended to use more professional frameworks.

@OleehyO When reducing the resolution did use the CogVideoX-1.0-t2v SAT version since given the readme.md the CogVideo1.5 cannot support resolutions lower than 768. Can you provide any insights? Also if CogVideo-1.5 cannot support lower resolutions than 768, should I use the git tag 1.0 as your collaborator @zRzRzRzRzRzRzR suggested in an earlier issue:

antragoudaras avatar Jul 22 '25 14:07 antragoudaras