diffusers icon indicating copy to clipboard operation
diffusers copied to clipboard

Best Way To Load Multiple Fine-Tuned Models?

Open adhikjoshi opened this issue 2 years ago • 8 comments

Describe the bug

I am trying to load upto 4 fine-tuned models using pipeline,

Here is what my code looks like,

pipe1 = StableDiffusionPipeline.from_pretrained(model_path1, revision="fp16", torch_dtype=torch.float16)
pipe2 = StableDiffusionPipeline.from_pretrained(model_path2, revision="fp16", torch_dtype=torch.float16)
pipe3 = StableDiffusionPipeline.from_pretrained(model_path3, revision="fp16", torch_dtype=torch.float16)
pipe4 = StableDiffusionPipeline.from_pretrained(model_path4, revision="fp16", torch_dtype=torch.float16)

And as you have imagined, it's painfully slow and memory inefficient way to do.

How can i load multiple fine-tuned models like custom_pipeline="stable_diffusion_mega" Or Pass array path and it loads memory efficiently?

Or Like,

img2text = StableDiffusionPipeline.from_pretrained("CompVis/stable-diffusion-v1-4")
img2img = StableDiffusionImg2ImgPipeline(**img2text.components)
inpaint = StableDiffusionInpaintPipeline(**img2text.components) 

Reproduction

No response

Logs

No response

System Info

diffusers 6.0 Python 3.1

adhikjoshi avatar Oct 30 '22 19:10 adhikjoshi

Try using joblib and cache each pipe to disk. It will cache the pipe, model, and plotted memory to disk, and they load in ~2 seconds (1second on my system, 2 seconds on colabs). So switching to each type is very quick.

Example:

joblib.dump(StableDiffusionPipeline.from_pretrained(model_id, torch_dtype=torch.float16, use_auth_token=True).to(device), f'{PIPE_CACHE}/LOW_VRAM_PIPE.obj')

and

pipe = joblib.load(f'{PIPE_CACHE}/LOW_VRAM_PIPE.obj')

WASasquatch avatar Oct 30 '22 21:10 WASasquatch

Good question - at the moment the most memory efficient and fastest way to load the models is via devcice_map="auto":

pipe1 = StableDiffusionPipeline.from_pretrained(model_path1, revision="fp16", torch_dtype=torch.float16, device_map="auto")
pipe2 = StableDiffusionPipeline.from_pretrained(model_path2, revision="fp16", torch_dtype=torch.float16, device_map="auto")
pipe3 = StableDiffusionPipeline.from_pretrained(model_path3, revision="fp16", torch_dtype=torch.float16, device_map="auto")
pipe4 = StableDiffusionPipeline.from_pretrained(model_path4, revision="fp16", torch_dtype=torch.float16, device_map="auto")

This should cut loading time by 80-90% . Soon we'll have this enabled by default (cc @patil-suraj )

patrickvonplaten avatar Nov 02 '22 12:11 patrickvonplaten

@adhikjoshi ,

If you only fine-tuned the unet, then I would say the second option would be much efficient along with the device_map=auto. For example if you have multiple fine-tune unets

pipe1 =  StableDiffusionPipeline.from_pretrained(model_path1, revision="fp16", torch_dtype=torch.float16, device_map="auto")

unet_2 = UNet2DConditionalModel.from_pretrained(model_path2, subfolder="unet", revision="fp16", torch_dtype=torch.float16, device_map="auto")

components = pipe1.components
components["unet"] = unet_2
pipe2 = StableDiffusionPipeline(**components)

This will avoid loading the vae and text encoder multiple times since they are fixed.

patil-suraj avatar Nov 02 '22 12:11 patil-suraj

This should cut loading time by 80-90% . Soon we'll have this enabled by default (cc @patil-suraj )

Wouldn't that be a bad idea since it's not compatible with CPU devices, and likely others? I think it's not compatible with half floats too like LayerNormKernelImpl

WASasquatch avatar Nov 02 '22 17:11 WASasquatch

It is compatible with CPU devices. Feel free to try it out

patrickvonplaten avatar Nov 04 '22 16:11 patrickvonplaten

Well, I have, hence the error-ing method I cited which gets triggered, which cannot be a half-float on CPU. There are lot of quirks with using CPU with diffusers right now, such as severely hindered performance (50-65% of CPU usage only, alternating back and forth).

WASasquatch avatar Nov 04 '22 17:11 WASasquatch

Try using joblib and cache each pipe to disk. It will cache the pipe, model, and plotted memory to disk, and they load in ~2 seconds (1second on my system, 2 seconds on colabs). So switching to each type is very quick.

Example:

joblib.dump(StableDiffusionPipeline.from_pretrained(model_id, torch_dtype=torch.float16, use_auth_token=True).to(device), f'{PIPE_CACHE}/LOW_VRAM_PIPE.obj')

and

pipe = joblib.load(f'{PIPE_CACHE}/LOW_VRAM_PIPE.obj')

What's your configuration of hardware/system which allow you to load in 1s?

CPU/Ram/nvme?

adhikjoshi avatar Nov 05 '22 15:11 adhikjoshi

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

github-actions[bot] avatar Nov 30 '22 15:11 github-actions[bot]

Try using joblib and cache each pipe to disk. It will cache the pipe, model, and plotted memory to disk, and they load in ~2 seconds (1second on my system, 2 seconds on colabs). So switching to each type is very quick. Example:

joblib.dump(StableDiffusionPipeline.from_pretrained(model_id, torch_dtype=torch.float16, use_auth_token=True).to(device), f'{PIPE_CACHE}/LOW_VRAM_PIPE.obj')

and

pipe = joblib.load(f'{PIPE_CACHE}/LOW_VRAM_PIPE.obj')

What's your configuration of hardware/system which allow you to load in 1s?

CPU/Ram/nvme?

No clue what colabs uses as CPUs, but with Pro acc it gives you 64gb RAM. No clue about storage either. It's about the same on my system though (besides the whole checking model files online), which is a i9 13th 13900kf / 32GB RAM / 6200 RPM HDD

This method is no longer valid across Python versions/sessions/envs due to the diffusers_modules fake class thing applied from https://github.com/huggingface/diffusers/blob/main/src/diffusers/utils/init.py#L77

Due to this, when you load a cached pipe in a new Python session, it throws a exception that it can't find the class file diffusers_modules which it will never be able to find, cause it doesn't exist and is used internally by diffusers. They will load up in the same session though, which still offers some speed.

Unfortunate that it's no longer viable, as diffusers weight folders etc are obtrusive and linear loading of individual files from I/O much slower than a single file like a ckpt or cached obj.

WASasquatch avatar Nov 30 '22 21:11 WASasquatch

Try using joblib and cache each pipe to disk. It will cache the pipe, model, and plotted memory to disk, and they load in ~2 seconds (1second on my system, 2 seconds on colabs). So switching to each type is very quick. Example:

joblib.dump(StableDiffusionPipeline.from_pretrained(model_id, torch_dtype=torch.float16, use_auth_token=True).to(device), f'{PIPE_CACHE}/LOW_VRAM_PIPE.obj')

and

pipe = joblib.load(f'{PIPE_CACHE}/LOW_VRAM_PIPE.obj')

What's your configuration of hardware/system which allow you to load in 1s?

CPU/Ram/nvme?

No clue what colabs uses as CPUs, but with Pro acc it gives you 64gb RAM. No clue about storage either. It's about the same on my system though (besides the whole checking model files online), which is a i9 13th 13900kf / 32GB RAM / 6200 RPM HDD

This method is no longer valid across Python versions/sessions/envs due to the diffusers_modules fake class thing applied from https://github.com/huggingface/diffusers/blob/main/src/diffusers/utils/init.py#L77

Due to this, when you load a cached pipe in a new Python session, it throws a exception that it can't find the class file diffusers_modules which it will never be able to find, cause it doesn't exist and is used internally by diffusers. They will load up in the same session though, which still offers some speed.

Unfortunate that it's no longer viable, as diffusers weight folders etc are obtrusive and linear loading of individual files from I/O much slower than a single file like a ckpt or cached obj.

I've tried to use pickle and it's faster than joblib

Also, i load different session outside of pickle which is overwrites original session.

Though it's bit more but there's no way out.

adhikjoshi avatar Dec 03 '22 10:12 adhikjoshi

The issue with pickle is it doesn't have memory mapping, for the objects in RAM/VRAM. Joblib takes the entire state of an object and caches it. Good for production dev, and debugging across systems.

WASasquatch avatar Dec 03 '22 19:12 WASasquatch

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

github-actions[bot] avatar Dec 28 '22 15:12 github-actions[bot]