An Idea for partial sequential cpu offloading

Open rodjjo opened this issue 1 month ago • 0 comments

Just an idea.

It's not a problem or anything...

I've be using a custom offload for my potato GPU. Maybe there is another way to do it or so...

In short, I've being using sequential offloading for a long time, when I enable it it use a minimal of VRAM, however I know It could use more VRAM to do less IO, so I created a Mixin for partial CPU offload where the model can keep several layers on GPU and just offload some.

See code here: https://gist.github.com/rodjjo/20e2e842fea9ed58114adb560a4566b6

 class MyQwen3ForCausalLM(Qwen3ForCausalLM, PartialOffloadMixin):
          LAYERS_KEEP_GPU = 22
          MODEL_ATTR_NAME = "model"
          MODEL_LAYERS_ATTR_NAME = "layers"
          OFFLOAD_ON_CALL = True
       model = MyQwen3ForCausalLM.from_pretrained(
            repo_id,
            subfolder="text_encoder",
            local_files_only=True,
            torch_dtype=torch.bfloat16,
       )
       model.eval()
       model.enable_partial_cpu_offload()
       # pseudo code of inference
       result = model(...)  # call was overrided and calls go_gpu(True) go_gpu(False)
      example transformer:
      class MyZImageTransformer(ZImageTransformer2DModel, PartialOffloadMixin):
          MODEL_LAYERS_ATTR_NAME = "layers"
          LAYERS_KEEP_GPU = 22
      model = MyZImageTransformer.from_pretrained(
          repo_id,
          subfolder="transformer",
          torch_dtype=torch.bfloat16,
      )
      model.eval()
      model.enable_partial_cpu_offload()
      
      # denoise step
      model.go_gpu(True)
  
      while denoising:  #pseudo code
           predicted = model(...)
      model.go_gpu(False)

It's saving me 12 to 13 seconds of inference in zimage turbo (my custom pipeline with this partial layers offloading):

Before (normal sequential offloading):

After (partial sequential offloading):

Nov 29 '25 15:11 rodjjo