DiffSynth-Studio [feature request] add pre-calculating latent / text encoder outputs

precalculating the text encoder embeddings can improve vram usage by only loading the text encoders when the dataset needs to be preprocessed, this also can apply to the vae, so that the only thing that needs to be loaded when training is the main diffusion model / unet / dit / that thing. something like making a modified metadata.csv that includes the text encoder embed path and the latent path relating to each video/image name so that the trainer can find the embed / latent

this should apply to all models, so it can benefit the entire repo (but notably helps the models with t5-xxl / umt5-xxl, as it is a very large model), the only flaws with it could be a lack of dynamic tag-based dropout, but entire dropout could work by having a precalculated empty string embedding

Nov 22 '25 22:11 yoinked-h

you may want to find it here. I’m not sure about other models, but for Qwen, it significantly helps reduce VRAM usage.

Nov 29 '25 16:11 hungnguyen2611

you may want to find it here. I’m not sure about other models, but for Qwen, it significantly helps reduce VRAM usage.

yep, that looks exactly like what i mentioned; ill try and see how adaptable it is to other models (notably, wan) but seems as a good base

Nov 29 '25 23:11 yoinked-h