What does this PR do?

Adds utilities to support _no_split_modules to the ModelMixin. Closely follows what's done in https://github.com/huggingface/transformers/blob/main/src/transformers/modeling_utils.py.

Part of https://github.com/huggingface/diffusers/issues/6240.

I think it's better to tackle the introduction of device_map="auto" to pipelines in multiple PRs. @SunMarc laid out a very nice plan here (internal Slack link).

TODO

[x] Get initial reviews from an accelerate core maintainer
[x] Propagate to other important models inheriting ModelMixin
[x] Add tests
[ ] Docs (if needed)

Dec 30 '23 05:12 sayakpaul

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Jan 09 '24 03:01 HuggingFaceDocBuilderDev

@SunMarc so, I incorporated the changes and tested with:

from diffusers import UNet2DConditionModel

unet = UNet2DConditionModel.from_pretrained("stabilityai/stable-diffusion-xl-base-1.0", subfolder="unet", device_map="auto")
print(unet.hf_device_map)

It prints:

{'': 0}

I tested this on a single GPU. Does this look correct?

@patrickvonplaten I have gone through the structures but would appreciate a confirmation if BasicTransformerBlock and ResnetBlock2D are indeed the only blocks that contain a residual path in their forward() method (consider the base model is an SDXL UNet).

Jan 10 '24 12:01 sayakpaul

I tested this on a single GPU. Does this look correct?

Yes, it looks correct. Try to play with multiple gpu and if you are able to run the model correctly since users uses device_map to split the model on multiple gpus.

Jan 10 '24 15:01 SunMarc

Try to play with multiple gpu and if you are able to run the model correctly since users uses device_map to split the model on multiple gpus.

Do you mean using the same code example but on multiple GPUs? How should the inputs be constructed, then? How should we handle device placement for them?

Jan 10 '24 16:01 sayakpaul

@SunMarc I tried on two GPUs. Here are some findings.

Test code

from diffusers import UNet2DConditionModel
import torch 

unet = UNet2DConditionModel.from_pretrained(
    "stabilityai/stable-diffusion-xl-base-1.0", subfolder="unet", device_map="sequential"
)
print(unet.hf_device_map)

# Inputs
sample = torch.randn(1, 4, 128, 128).to("cuda")
t = torch.randint(1, 1000, size=(1, )).to("cuda")
encoder_hidden_states = torch.randn(1, 77, 2048).to("cuda")
add_text_embeds = torch.randn(1, 1280).to("cuda")
add_time_ids = torch.randn(1, 6).to("cuda")
added_cond_kwargs = {"text_embeds": add_text_embeds, "time_ids": add_time_ids}

# Forward
with torch.no_grad():
    outputs = unet(
        sample=sample,
        timestep=t,
        encoder_hidden_states=encoder_hidden_states,
        added_cond_kwargs=added_cond_kwargs
    ).sample
    print(outputs.shape)

With ["BasicTransformerBlock", "ResnetBlock2D"] specified in _no_split_modules of UNet2DConditionModel, it leads to the following device map:

{'conv_in': 0, 'time_proj': 0, 'time_embedding': 0, 'add_time_proj': 0, 'add_embedding': 0, 'down_blocks': 0, 'up_blocks.0.attentions.0': 0, 'up_blocks.0.attentions.1.norm': 0, 'up_blocks.0.attentions.1.proj_in': 0, 'up_blocks.0.attentions.1.transformer_blocks.0': 0, 'up_blocks.0.attentions.1.transformer_blocks.1': 1, 'up_blocks.0.attentions.1.transformer_blocks.2': 1, 'up_blocks.0.attentions.1.transformer_blocks.3': 1, 'up_blocks.0.attentions.1.transformer_blocks.4': 1, 'up_blocks.0.attentions.1.transformer_blocks.5': 1, 'up_blocks.0.attentions.1.transformer_blocks.6': 1, 'up_blocks.0.attentions.1.transformer_blocks.7': 1, 'up_blocks.0.attentions.1.transformer_blocks.8': 1, 'up_blocks.0.attentions.1.transformer_blocks.9': 1, 'up_blocks.0.attentions.1.proj_out': 1, 'up_blocks.0.attentions.2': 1, 'up_blocks.0.resnets': 1, 'up_blocks.0.upsamplers': 1, 'up_blocks.1': 1, 'up_blocks.2': 1, 'mid_block': 1, 'conv_norm_out': 1, 'conv_act': 1, 'conv_out': 1}

However, it leads to the following error:

Traceback (most recent call last):
  File "/home/sayak/diffusers/test_single_file.py", line 19, in <module>
    outputs = unet(
  File "/home/sayak/.pyenv/versions/diffusers/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/sayak/.pyenv/versions/diffusers/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/sayak/.pyenv/versions/diffusers/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward
    output = module._old_forward(*args, **kwargs)
  File "/home/sayak/diffusers/src/diffusers/models/unet_2d_condition.py", line 1197, in forward
    sample = upsample_block(
  File "/home/sayak/.pyenv/versions/diffusers/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/sayak/.pyenv/versions/diffusers/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/sayak/diffusers/src/diffusers/models/unet_2d_blocks.py", line 2324, in forward
    hidden_states = torch.cat([hidden_states, res_hidden_states], dim=1)
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:1 and cuda:0! (when checking argument for argument tensors in method wrapper_CUDA_cat)

"CrossAttnUpBlock2D" is the block that causes this and when added to _no_split_modules alongside ["BasicTransformerBlock", "ResnetBlock2D"], the error went away and I was able to obtain the output. The device map prints as follows:

{'': 0}

Seems like nothing is being split, which I think is the expected result here?

Jan 11 '24 03:01 sayakpaul

Do you mean using the same code example but on multiple GPUs? How should the inputs be constructed, then? How should we handle device placement for them?

The inputs will be automatically dispatched to the right device because accelerate adds hooks for that to the modules.

"CrossAttnUpBlock2D" is the block that causes this and when added to _no_split_modules alongside ["BasicTransformerBlock", "ResnetBlock2D"], the error went away and I was able to obtain the output. The device map prints as follows:

{'': 0} Seems like nothing is being split, which I think is the expected result here?

No that should be the case since we want the model to be split. The results we should get is something like:

{'conv_in': 0, 'time_proj': 0, 'time_embedding': 0, 'add_time_proj': 0, 'add_embedding': 0, 'down_blocks': 0, 'up_blocks.0.attentions.0': 0, 'up_blocks.0.attentions.1': 1, 'up_blocks.0.attentions.2': 1, 'up_blocks.0.resnets': 1, 'up_blocks.0.upsamplers': 1, 'up_blocks.1': 1, 'up_blocks.2': 1, 'mid_block': 1, 'conv_norm_out': 1, 'conv_act': 1, 'conv_out': 1}

In the previous example, the inference failed since the CrossAttnUpBlock2D is concatenating hidden_states that are coming from different devices. I suspect the problem comes from this mapping which splits the attention block. So indeed, we should add CrossAttnUpBlock2D inside _no_split_modules. Another way would be to make that that hidden_states, res_hidden_states are on the same device but I prefer not to add anything in the modeling code :

{'up_blocks.0.attentions.1.norm': 0, 'up_blocks.0.attentions.1.proj_in': 0, 'up_blocks.0.attentions.1.transformer_blocks.0': 0, 'up_blocks.0.attentions.1.transformer_blocks.1': 1, 'up_blocks.0.attentions.1.transformer_blocks.2': 1, 'up_blocks.0.attentions.1.transformer_blocks.3': 1, 'up_blocks.0.attentions.1.transformer_blocks.4': 1, 'up_blocks.0.attentions.1.transformer_blocks.5': 1, 'up_blocks.0.attentions.1.transformer_blocks.6': 1, 'up_blocks.0.attentions.1.transformer_blocks.7': 1, 'up_blocks.0.attentions.1.transformer_blocks.8': 1, 'up_blocks.0.attentions.1.transformer_blocks.9': 1, 'up_blocks.0.attentions.1.proj_out': 1}

Jan 11 '24 15:01 SunMarc

Thanks for providing your inputs.

Another way would be to make that that hidden_states, res_hidden_states are on the same device but I prefer not to add anything in the modeling code :

Indeed this should be preferred. We don't want to touch the forward call until and unless absolutely necessary.

I suspect the problem comes from this mapping which splits the attention block. So indeed, we should add CrossAttnUpBlock2D inside _no_split_modules.

But when I did that the model doesn't seem to split though. What are we missing here? Would you be able to take deeper look or provide me pointers to see this through further?

Jan 11 '24 16:01 sayakpaul

I've traced back to the issue. It is an issue on accelerate where the memory allocation + module placement is not very good when we have models where the largest non splittable layer is very big compared to the whole model. In our case, by specifying CrossAttnUpBlock2D , the module up_blocks.0 become non splittable and the fact that it represent half of the memory (5GB out of 10GB) and we get a bad module placement. This is why I was recommending to have smaller non splittable blocks. Nevertheless, this is what needs to be added into _no_split_modules if we don't want to modify the modeling file. I can try to fix it in accelerate but I might require quite some time since it can impacting all models on transformers depending on the fix. This model is pretty small, so it will fit in one gpu. To continue with the PR, can you try other model by adding the _no_split_modules ? This way, we can try to see if this is a recurrent issue or not.

I forgot to mention but you can also put your own device_map to check if the inference works for a specific placement since the generated device_map is not optimal. For example, this device map works with the UNet2DConditionModel . It shows that you indeed need to have the up_blocks non split.

device_map = {
    "conv_in": 0,
    "time_proj": 0,
    "time_embedding": 0,
    "add_time_proj": 0,
    "add_embedding": 0,
    "down_blocks": 0,
    "up_blocks.0": 0, 
    "up_blocks.1": 1,
    "up_blocks.2": 1,
    "mid_block": 1,
    "conv_norm_out": 1,
    "conv_act": 1,
    "conv_out": 1,
}

Jan 12 '24 13:01 SunMarc

Nevertheless, this is what needs to be added into _no_split_modules if we don't want to modify the modeling file.

I think we definitely don't want to change the modeling code following what transformers does.

I will try on other models and maybe even on a smaller GPU. The smallest I have access to is 16GB.

Jan 12 '24 13:01 sayakpaul

@SunMarc seems like a good progress now.

Since I am trying on a machine having two 4090s, tried the following to restrict the memory so that device_map takes effect:

unet = UNet2DConditionModel.from_pretrained(
    "stabilityai/stable-diffusion-xl-base-1.0",
    subfolder="unet", 
    device_map="auto",
    max_memory={0: "6GiB", 1: "10GiB"},
)
print(unet.hf_device_map)

Worked like a charm!

The device map:

{'conv_in': 0, 'time_proj': 0, 'time_embedding': 0, 'add_time_proj': 0, 'add_embedding': 0, 'down_blocks.0': 0, 'down_blocks.1': 0, 'down_blocks.2.attentions.0.norm': 0, 'down_blocks.2.attentions.0.proj_in': 0, 'down_blocks.2.attentions.0.transformer_blocks.0': 0, 'down_blocks.2.attentions.0.transformer_blocks.1': 0, 'down_blocks.2.attentions.0.transformer_blocks.2': 0, 'down_blocks.2.attentions.0.transformer_blocks.3': 0, 'down_blocks.2.attentions.0.transformer_blocks.4': 0, 'down_blocks.2.attentions.0.transformer_blocks.5': 0, 'down_blocks.2.attentions.0.transformer_blocks.6': 0, 'down_blocks.2.attentions.0.transformer_blocks.7': 0, 'down_blocks.2.attentions.0.transformer_blocks.8': 0, 'down_blocks.2.attentions.0.transformer_blocks.9': 1, 'down_blocks.2.attentions.0.proj_out': 1, 'down_blocks.2.attentions.1': 1, 'down_blocks.2.resnets': 1, 'up_blocks': 1, 'mid_block': 1, 'conv_norm_out': 1, 'conv_act': 1, 'conv_out': 1}

I have also added two tests closely following this and this. Have tested it too with the following:

RUN_SLOW=1 pytest tests/models/test_models_unet_2d_condition.py -k "offload"

I think we can add docs after we ship this feature to pipelines because that provides a fuller context.

Meanwhile, could you go through the PR once in detail and let me know your thoughts? Once that's done, will add _no_split_modules to other models and mark it ready for review.

Cc: @DN6 for awareness.

Jan 15 '24 10:01 sayakpaul

@SunMarc we don't run multi-gpu tests yet because this hasn't been a strong case for us.

Jan 18 '24 16:01 sayakpaul

@SunMarc we don't run multi-gpu tests yet because this hasn't been a strong case for us.

Makes sense. But still can we have them and skip them in the CI ? They are useful to check that we did the splitting correctly and are able to run them (_no_split_modules )

Jan 18 '24 16:01 SunMarc

@SunMarc we don't run multi-gpu tests yet because this hasn't been a strong case for us.

Makes sense. But still can we have them and skip them in the CI ? They are useful to check that we did the splitting correctly and are able to run them (_no_split_modules )

Good idea. I will add it. Apart from that changes you requested, is there anything else on you would like me to change as far as the core design goes?

Jan 18 '24 16:01 sayakpaul

No, the core design looks very good ! It is similar to transformers and device_map is working well there.

Jan 18 '24 17:01 SunMarc

@SunMarc I added the multi-GPU parallelism test and also test_disk_offload_without_safetensors. Some notes:

model_split_percents = [0.5, 0.3, 0.4] is the one that seems to work for both multi-GPU and single-GPU environments for the UNet under consideration. The size of the UNet is definitely small.
I had to pass an offload_folder to test_disk_offload_with_safetensors to make it work with the model_split_percents for the given UNet.

Let me know your thoughts.

Jan 19 '24 08:01 sayakpaul

model_split_percents = [0.5, 0.3, 0.4] is the one that seems to work for both multi-GPU and single-GPU environments for the UNet under consideration. The size of the UNet is definitely small.

Makes sense, the model is small + the non splittable modules are big.

I had to pass an offload_folder to test_disk_offload_with_safetensors to make it work with the model_split_percents for the given UNet.

You can use disk offload without having to pass offload_folder when using safentesors format. Check this PR in transformers. This can be implemented in a follow up PR since it is not essential. LMK if you want my help on that.

Jan 22 '24 19:01 SunMarc

You can use disk offload without having to pass offload_folder when using safentesors format. Check this PR in transformers. This can be implemented in a follow up PR since it is not essential. LMK if you want my help on that.

@SunMarc thanks! I think it would be better to have it in a follow-up PR. Would appreciate your help in that :)

Jan 23 '24 02:01 sayakpaul

@patrickvonplaten this is ready for a review now. I propose to add the docs after we ship device map support to pipelines to have more context. Let me know what you think.

@SunMarc in case you want to take another look (which I would appreciate since this is an important PR). But, really thank you for all your help thus far!

Jan 23 '24 02:01 sayakpaul

How can we use device_map="auto" for inference here? Can it be used when loading a pipeline?

As stated in the PR description and the internal message, we need to be able to add support to models first. Once this is merged, we need to add support for pipelines accordingly. Passing device_map="auto" to models should work, given its _no_split_modules has been set just like transformers. Let me know if that's clear.

Jan 23 '24 11:01 sayakpaul

@sayakpaul

How can we use device_map="auto" for inference here? Can it be used when loading a pipeline?

As stated in the PR description and the internal message, we need to be able to add support to models first. Once this is merged, we need to add support for pipelines accordingly. Passing device_map="auto" to models should work, given its _no_split_modules has been set just like transformers. Let me know if that's clear.

hey, i noticed that you guys are working on something i interested. i'm seeking some elegant solution to execute pipeline on multiple gpus. but is this feature only can be used through device_map=‘auto’? i think the auto device_map is analyzed according to the model parameters, but what about the model input? like the sd pipeline, generate resolution also significantly impacts memory usage。so if the gpu that every model use can be set manually will be a better solution. because i can test every different gpu setting

Jan 24 '24 09:01 zhangvia

so if the gpu that every model use can be set manually will be a better solution. because i can test every different gpu setting

You will be able set the max memory usage for each gpu (e.g. max_memory={0: "6GiB", 1: "10GiB"}). This way you can make sure to leave enough space for the model input.

Jan 24 '24 14:01 SunMarc

so if the gpu that every model use can be set manually will be a better solution. because i can test every different gpu setting

You will be able set the max memory usage for each gpu (e.g. max_memory={0: "6GiB", 1: "10GiB"}). This way you can make sure to leave enough space for the model input.

i'm not just talking about model input but the increase of the memory it brings. for example, when i generate 512*512 images using text2img pipeline, the memory cost will be much lower than generate 1024 * 1024. and the memory cost of single model like vae, unet ,controlnet are different. i test this case: i load two controlnet, a full img2img pipeline. i put the unet,vae to gpu0, the rest models to gpu1. i use two 2080ti (11g). it will get oom on gpu0 when generate 1024 * 1024. but if i put unet and controlnets to gpu0, the rest models to gpu1, i can generate 1024 * 1024

Jan 25 '24 01:01 zhangvia

But the ControlNet model doesn't yet have _no_split_modules yet. Let's maybe revisit your usecase once we add support for device_map to the pipelines.

Jan 25 '24 03:01 sayakpaul

The memory specification also varies a bit from how it's done in the language modeling world. For example, the memory specification for generating 512x512 resolution images will be different from that of generating 1024x1024 images, naturally. So, you will need to take that into consideration. @SunMarc am I thinking in the right direction?

Jan 25 '24 03:01 sayakpaul

The memory specification also varies a bit from how it's done in the language modeling world. For example, the memory specification for generating 512x512 resolution images will be different from that of generating 1024x1024 images, naturally. So, you will need to take that into consideration. @SunMarc am I thinking in the right direction?

that is what i'm thinking about. in my use case, you actually can find a model placement policy to generate 1024*1024 images when using two 2080ti(12g). but the device_map=auto may get oom

Jan 25 '24 05:01 zhangvia

That could be because it's not input-aware. In those cases, handcrafting the memory map is better.

Jan 25 '24 05:01 sayakpaul

what do u mean memory map? how can i ensure my use case won't get oom through memory map?

Jan 25 '24 05:01 zhangvia

The device map where you can specify which device should get what ratios for splitting.

There is no one single answer to the other question, as it requires analysing the memory consumption w.r.t the inputs (as with resolution scaling, it can grow more drastically than how it is with language models) and then crafting a device map that works for you.

As mentioned, the support for device maps in pipelines is not there. So, I cannot give you more concrete guidelines yet. But we will be sure consider these things and clearly document them.

Jan 25 '24 05:01 sayakpaul

The device map where you can specify which device should get what ratios for splitting.

There is no one single answer to the other question, as it requires analysing the memory consumption w.r.t the inputs (as with resolution scaling, it can grow more drastically than how it is with language models) and then crafting a device map that works for you.

As mentioned, the support for device maps in pipelines is not there. So, I cannot give you more concrete guidelines yet. But we will be sure consider these things and clearly document them.

Thank you for your patient explanation. i will definitely try it when it's done

Jan 25 '24 05:01 zhangvia

I'm still not sure whether the way we support device_map here is the right way to do so. Instead of splitting the unet over multiple devices, it would be much better to move each component to one device - e.g. text_encoder is on device_0, unet in device_1, vae on device_0 again etc... IMO we first should try to do map different components to different devices before splitting one component over multiple devices.

What is the use case exactly for splitting the unet over multiple devices (and how should the text_encoder and text_encoder_2 then be split?)

Jan 25 '24 08:01 patrickvonplaten

diffusers
diffusers copied to clipboard

[Core] introduce _no_split_modules to `ModelMixin`

What does this PR do?

TODO

diffusers diffusers copied to clipboard

[Core] introduce _no_split_modules to `ModelMixin`

What does this PR do?

TODO

diffusers
diffusers copied to clipboard