Can an NVIDIA RTX 4090 (24GB VRAM) run Open-Sora for training or inference?
Question Is it possible to run the Open-Sora 11B model (text-to-video generation) on a single NVIDIA RTX 4090 (24GB VRAM) GPU for either training or inference? If so, what optimizations (e.g., quantization, offloading, parallelization) or parameter adjustments are required?
Context From previous benchmarks (e.g., H100/H800 tests with colossalai):
256x256 resolution inference requires ~52.5GB VRAM on 1x GPU. 768x768 resolution inference requires ~60.3GB VRAM on 1x GPU. Given the RTX 4090’s 24GB VRAM limit, I’m seeking guidance on whether Open-Sora can be adapted for this hardware.
Key Details Use Case:
[ ] Training (fine-tuning) [x] Inference (text-to-video generation) Target resolution: 128x128 or 256x256. Optimizations Considered:
4-bit/8-bit quantization (via bitsandbytes or GPTQ). Gradient checkpointing. Offloading parameters to CPU RAM. Tensor/sequence parallelism (multi-GPU if feasible). Environment:
GPU: NVIDIA RTX 4090 (24GB). CUDA: 12.x. Framework: PyTorch 2.x. Request Are there official configurations or workarounds to run Open-Sora on 24GB VRAM? Any recommended parameters (e.g., reduced diffusion steps, batch size=1)? Does the codebase support mixed-precision inference or model sharding for consumer GPUs? Additional Notes Willing to trade off video quality/speed for lower VRAM usage. Open to modifying code/hyperparameters if needed.
currently RTX 4090 cannot satisfy the requirements for both training and inference, the potential solution would be quantization. no official official configurations or workarounds to run Open-Sora on consumer-graded gpu for now.
This issue is stale because it has been open for 7 days with no activity.
This issue was closed because it has been inactive for 7 days since being marked as stale.