Comparison to ARLDM
I noticed that your paper compares StoryGen with ARLDM. I'm also using an NVIDIA GeForce RTX 3090, but I frequently encounter "CUDA out of memory" errors when training ARLDM. I was wondering if you could kindly share how you managed the memory issue. Did you apply any specific optimizations or adjustments to the model?
Since it's been a long time since I trained and tested these baselines, maybe I've forgotten some of the details. However, I remember that we adopted the official AR-LDM code and used common optimization libraries like accelerate and xformers. Additionally, I would suggest that for the frozen components (such as the VAE, well, I can't quite remember if the text-encoder is also frozen), you could perform feature extraction in advance and avoid loading those parameters, which could save a significant amount of GPU memory during training.