diffusers icon indicating copy to clipboard operation
diffusers copied to clipboard

[Core] add auto `device_map` support to pipelines

Open sayakpaul opened this issue 1 year ago • 37 comments

What does this PR do?

As per what's discussed in https://github.com/huggingface/diffusers/pull/6396, this PR adds support for "balanced" device map (and other variants) in the pipelines. It's NOT complete yet.

TODOs

  • [x] Catch errors for unexpected configurations
  • [ ] Docs
  • [x] Tests

Testing:

from diffusers import DiffusionPipeline
import argparse
import torch

def run_pipeline(args):
    if args.do_device_map:
        pipeline = DiffusionPipeline.from_pretrained(
            args.ckpt_id,
            variant="fp16",
            torch_dtype=torch.float16,
            device_map="balanced",
        )
        if hasattr(pipeline, "safety_checker"):
            pipeline.safety_checker = None

    else:
        pipeline = DiffusionPipeline.from_pretrained(
            args.ckpt_id,
            variant="fp16",
            torch_dtype=torch.float16,
        )
        if hasattr(pipeline, "safety_checker"):
            pipeline.safety_checker = None
        
        pipeline = pipeline.to("cuda")

    image = pipeline(
        "picture of a dog", num_inference_steps=args.num_inference_steps, generator=torch.manual_seed(0)
    ).images[0]
    image.save(f"resultant_image_{args.do_device_map}.png")


if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument("--ckpt_id", default="runwayml/stable-diffusion-v1-5", choices=["runwayml/stable-diffusion-v1-5", "stabilityai/stable-diffusion-xl-base-1.0"])
    parser.add_argument("--num_inference_steps", type=int, default=5)
    parser.add_argument("--do_device_map", action="store_true")
    args = parser.parse_args()
    run_pipeline(args)

Tested on the DGX (with accelerate installed from the source).

CUDA_VISIBLE_DEVICES=1,2 python test_device_map_pipelines.py  --num_inference_steps=50

VAE: tensor([0.2964, 0.2983, 0.3008, 0.2917, 0.3213, 0.3174, 0.3298, 0.3298, 0.2352,
        0.2367, 0.2539, 0.2510], device='cuda:0', dtype=torch.float16)
CUDA_VISIBLE_DEVICES=1,2 python test_device_map_pipelines.py  --num_inference_steps=50 --do_device_map

VAE: tensor([0.2964, 0.2983, 0.3008, 0.2917, 0.3213, 0.3174, 0.3298, 0.3298, 0.2352,
        0.2367, 0.2539, 0.2510], device='cuda:0', dtype=torch.float16)

We can see that the outputs are matching when using device mapping.

Cc: @yiyixuxu for visibility.

sayakpaul avatar Feb 05 '24 11:02 sayakpaul

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@SunMarc some more progress here.

I was able to determine a dictionary mapping the model components to the available GPU devices. Example:

{'unet': 0, 'text_encoder': 1, 'vae': 1}

Under these conditions:

{'unet': 27653351728, 'text_encoder': 3481712032, 'vae': 2594036992}
{0: 24978194432, 1: 24978194432}

(I hope the dictionaries are self-explanatory)

This device map is created here BEFORE the actual models are loaded.

This is because load_sub_model() passes a device_map. Inside that method, we determine the loading method for a given model, which will always resolve to from_pretrained().

From https://github.com/huggingface/diffusers/pull/6396#issuecomment-1919988030:

Since each model are loaded on only one device, we don't add hooks by default. You need to set force_hook=True in load_checkpoint_and_dispatch. By doing that, we will add hooks that will move the data to the correct device when performing inference.

I think I can still pass a boolean indicator within load_sub_model() indicating that the force_hook argument should be set to True. But how do we handle the text encoders' case here? How can we let transformers know that it should set force_hook to True from the diffusers codebase?

Anyway the following seems to be working:

from diffusers import DiffusionPipeline
import torch

pipeline = DiffusionPipeline.from_pretrained(
    "runwayml/stable-diffusion-v1-5",
    variant="fp16",
    torch_dtype=torch.float16,
    device_map="auto",
    safety_checker=None,
)

for name, component in pipeline.components.items():
    if isinstance(component, torch.nn.Module):
        print(name, component.device)


_ = pipeline("picture of a dog", num_inference_steps=50)

Prints:

vae cuda:1
text_encoder cuda:1
unet cuda:0

But also throws:

image
You shouldn't move a model when it is dispatched on multiple devices.

However, it also produces:

/home/sayak/diffusers/src/diffusers/image_processor.py:90: RuntimeWarning: invalid value encountered in cast
  images = (images * 255).round().astype("uint8")

Could this be related to the device placement?

It is not the case when we do:

from diffusers import DiffusionPipeline
import torch

pipeline = DiffusionPipeline.from_pretrained(
    "runwayml/stable-diffusion-v1-5",
    variant="fp16",
    torch_dtype=torch.float16,
    safety_checker=None,
).to("cuda")

_ = pipeline("picture of a dog", num_inference_steps=50)

sayakpaul avatar Feb 09 '24 11:02 sayakpaul

Been trying to debug the cause of the black image results stemming from device_map="auto". Here are some findings.

Without any device_map, we have (five denoising steps):

Prompt embeds: tensor([[-3.8843e-01,  2.2949e-02, -5.2338e-02, -1.8420e-01],
        [-3.7183e-01, -1.4492e+00, -3.3936e-01, -1.0754e-01],
        [-5.1074e-01, -1.4629e+00, -2.9272e-01,  1.2255e-03],
        [-5.5518e-01, -1.4248e+00, -2.8711e-01,  6.1646e-02]], device='cuda:0',
       dtype=torch.float16)

Initial latents: tensor([-0.9014,  0.1541,  0.2152, -0.6416,  1.0215, -0.3105, -1.4922,  0.0122,
        -0.9941,  0.6323,  0.5259,  0.1608,  0.9238, -1.2178,  0.4255, -1.7715],
       device='cuda:0', dtype=torch.float16)

UNet predictions 0: tensor([-0.9072,  0.1572,  0.2156, -0.6226,  1.0273, -0.2859, -1.4678,  0.0511,
        -1.0010,  0.6211,  0.4883,  0.1566,  0.9639, -1.2197,  0.4321, -1.7646],
       device='cuda:0', dtype=torch.float16)
UNet predictions 1: tensor([-0.9502,  0.1558,  0.1925, -0.6182,  0.9028, -0.2756, -1.4424,  0.0494,
        -1.0000,  0.5996,  0.4292,  0.1602,  1.0068, -1.1729,  0.4441, -1.6826],
       device='cuda:0', dtype=torch.float16)
UNet predictions 2: tensor([-0.9478,  0.1094,  0.1261, -0.6631,  0.9561, -0.3069, -1.5049,  0.0400,
        -0.9795,  0.6064,  0.4529,  0.1589,  1.0205, -1.1543,  0.4565, -1.6758],
       device='cuda:0', dtype=torch.float16)
UNet predictions 3: tensor([-0.8262,  0.1978,  0.0524, -0.5957,  0.9131, -0.1609, -1.4922,  0.1169,
        -1.0059,  0.6895,  0.3044,  0.2878,  1.0361, -0.9067,  0.3342, -1.4824],
       device='cuda:0', dtype=torch.float16)
UNet predictions 4: tensor([-1.0234,  0.4153, -0.0172, -0.6655,  0.7754, -0.0658, -1.7363,  0.0820,
        -1.3252,  1.0186,  0.2703,  0.2452,  0.5400, -0.4561, -0.0053, -1.4590],
       device='cuda:0', dtype=torch.float16)
UNet predictions 5: tensor([-0.3188,  0.1713, -0.0659, -0.3005,  0.2207, -0.0583, -0.3931,  0.0102,
        -0.2925,  0.0295, -0.4709,  0.0939,  0.2411, -0.3926,  0.0777, -0.2218],
       device='cuda:0', dtype=torch.float16)


VAE: tensor([ 0.0027,  0.0687,  0.1221,  0.1288, -0.0571,  0.0011,  0.0557,  0.0393,
        -0.1168, -0.0524,  0.0131,  0.0068], device='cuda:0',
       dtype=torch.float16)

With device_map="auto":

Prompt embeds: tensor([[-3.8843e-01,  2.2949e-02, -5.2338e-02, -1.8420e-01],
        [-3.7183e-01, -1.4492e+00, -3.3936e-01, -1.0754e-01],
        [-5.1074e-01, -1.4629e+00, -2.9272e-01,  1.2255e-03],
        [-5.5518e-01, -1.4248e+00, -2.8711e-01,  6.1646e-02]], device='cuda:1',
       dtype=torch.float16)

Initial latents: tensor([-0.9014,  0.1541,  0.2152, -0.6416,  1.0215, -0.3105, -1.4922,  0.0122,
        -0.9941,  0.6323,  0.5259,  0.1608,  0.9238, -1.2178,  0.4255, -1.7715],
       device='cuda:1', dtype=torch.float16)

UNet predictions 0: tensor([-0.3884, -0.2473, -0.2957, -0.0774, -0.6709,  1.2314, -0.7563,  0.9395,
        -0.7852, -0.3325, -1.0791, -0.0468, -0.4641, -1.5430, -0.6104, -1.1377],
       device='cuda:1', dtype=torch.float16)
UNet predictions 1: tensor([-0.9014,  0.1541,  0.2152, -0.6416,  1.0215, -0.3105, -1.4922,  0.0122,
        -0.9941,  0.6323,  0.5259,  0.1608,  0.9238, -1.2178,  0.4255, -1.7715],
       device='cuda:1', dtype=torch.float16)
UNet predictions 2: tensor([  0.8125,  -7.2891,   5.1875, -12.3125,   3.8184,  10.3203,   1.4805,
          5.5859,  -6.2812,  11.1172,  -2.3516,  12.0703,   3.5977, -18.9219,
          3.1055, -23.1250], device='cuda:1', dtype=torch.float16)
UNet predictions 3: tensor([ -0.0234,  -3.5703,   2.6973,  -6.4609,   2.3984,   5.0117,   0.0273,
          2.7969,  -3.6152,   5.8594,  -0.9258,   6.1094,   2.2402, -10.0469,
          1.7559, -12.4062], device='cuda:1', dtype=torch.float16)
UNet predictions 4: tensor([-1.7969,  4.6445, -2.8281,  6.5234, -0.8281, -6.7031, -3.0605, -3.3594,
         2.3633, -5.8125,  2.1855, -7.0938, -0.8398,  9.7031, -1.2617, 11.4375],
       device='cuda:1', dtype=torch.float16)
UNet predictions 5: tensor([-1.3057,  3.2754, -1.9756,  4.5508, -0.5312, -4.7227, -2.2227, -2.3555,
         1.6113, -4.0469,  1.5605, -4.9844, -0.5464,  6.7656, -0.8643,  7.9375],
       device='cuda:1', dtype=torch.float16)

VAE: tensor([ 0.0008,  0.0041, -0.0304, -0.0286, -0.0887, -0.1207, -0.1835, -0.2190,
        -0.2219, -0.2612, -0.3252, -0.3804], device='cuda:1',
       dtype=torch.float16)

We can clearly see that the predictions in the UNet start differing and in general, producing outputs having way higher norm than the case without device_map. Note that the prompt_embeds and the initial latents don't change (as is evident from the outputs).

sayakpaul avatar Feb 11 '24 03:02 sayakpaul

We can clearly see that the predictions in the UNet start differing and in general, producing outputs having way higher norm than the case without device_map. Note that the prompt_embeds and the initial latents don't change (as is evident from the outputs).

This is strange. On my tests, the prompt_embeds doesn't change but the initial latents changes every time whether I'm using device_map or not. Can you try with device_map but only on one device (need to remove the error that you are raising) ? If we indeed have the same results, the issue is probably due to the the data being moved across the gpus.

You shouldn't move a model when it is dispatched on multiple devices.

This warning should be fixed in this PR

/home/sayak/diffusers/src/diffusers/image_processor.py:90: RuntimeWarning: invalid value encountered in cast images = (images * 255).round().astype("uint8")

Oh that's strange, I don't have this issue

I think I can still pass a boolean indicator within load_sub_model() indicating that the force_hook argument should be set to True. But how do we handle the text encoders' case here? How can we let transformers know that it should set force_hook to True from the diffusers codebase?

The simplest solution would be to add this arg in the from_pretrained method of transformers in a PR.

I'll try to dig deeper on why I'm not able to reproduce the results. Thanks again for your work !

SunMarc avatar Feb 13 '24 21:02 SunMarc

@ArthurZucker need some help regarding https://github.com/huggingface/diffusers/pull/6857#issuecomment-1935795031:

I think I can still pass a boolean indicator within load_sub_model() indicating that the force_hook argument should be set to True. But how do we handle the text encoders' case here? How can we let transformers know that it should set force_hook to True from the diffusers codebase?

sayakpaul avatar Feb 14 '24 04:02 sayakpaul

This is strange. On my tests, the prompt_embeds doesn't change but the initial latents changes every time whether I'm using device_map or not. Can you try with device_map but only on one device (need to remove the error that you are raising) ? If we indeed have the same results, the issue is probably due to the the data being moved across the gpus.

Your hunch was correct. I did device map but with a single device. The intermediate values matched. But not when using multiple GPUs. Different results because of data movement maybe acceptable to an extent where they're not leading black images in this case.

Oh that's strange, I don't have this issue

It's kind of a bit random for a smaller number of steps. When you max it to 50 steps (num_inference_steps), you should see it.

Let me know.

sayakpaul avatar Feb 14 '24 04:02 sayakpaul

Your hunch was correct. I did device map but with a single device. The intermediate values matched. But not when using multiple GPUs. Different results because of data movement maybe acceptable to an extent where they're not leading black images in this case.

On my side, even without device_map, I get different latent space each time. Can you confirm that this is not the case on your side ?

SunMarc avatar Feb 14 '24 14:02 SunMarc

On my side, even without device_map, I get different latent space each time. Can you confirm that this is not the case on your side ?

I can confirm that it is NOT the case on my end. Let me send over my testing script:

from diffusers import DiffusionPipeline
import argparse
import torch

def run_pipeline(args):
    if args.do_device_map:
        pipeline = DiffusionPipeline.from_pretrained(
            "runwayml/stable-diffusion-v1-5",
            variant="fp16",
            torch_dtype=torch.float16,
            device_map="auto",
            safety_checker=None,
        )

        for name, component in pipeline.components.items():
            if isinstance(component, torch.nn.Module):
                print(name, component.hf_device_map)
    else:
        pipeline = DiffusionPipeline.from_pretrained(
            "runwayml/stable-diffusion-v1-5",
            variant="fp16",
            torch_dtype=torch.float16,
            safety_checker=None,
        ).to("cuda")

    _ = pipeline("picture of a dog", num_inference_steps=args.num_inference_steps, generator=torch.manual_seed(0))


if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument("--num_inference_steps", type=int, default=5)
    parser.add_argument("--do_device_map", action="store_true")
    args = parser.parse_args()
    run_pipeline(args)

This should probably work because in the previous examples, there was no generator which controls the randomness aspect. LMK.

sayakpaul avatar Feb 14 '24 14:02 sayakpaul

@SunMarc on my local clone of transformers I did pass force_hooks=True here:

https://github.com/huggingface/transformers/blob/ce4fff0be7f6464d713f7ac3e0bbaafbc6959ae5/src/transformers/modeling_utils.py#L3558

But it still didn't help prevent the NaN issues described above. Will appreciate some guidance.

sayakpaul avatar Feb 18 '24 11:02 sayakpaul

@SunMarc a gentle ping here.

sayakpaul avatar Feb 20 '24 03:02 sayakpaul

Sorry for the late reply, should I have a look?

ArthurZucker avatar Feb 20 '24 03:02 ArthurZucker

@ArthurZucker yes please: https://github.com/huggingface/diffusers/pull/6857#issuecomment-1943072097

sayakpaul avatar Feb 20 '24 14:02 sayakpaul

Alright it's a bit stretched out, sorry to ask you this, could you reformulate in terms of transformers what you want to do? Pass a model that is placed on HF devices by diffusers?

ArthurZucker avatar Feb 23 '24 08:02 ArthurZucker

Sure, let me retry.

Let me set the context as briefly as possible.

Any modern diffusion model is not just about a single model but rather a collection of different models chained together with a specific computation graph. We define that graph in the form of a "Pipeline" in diffusers. More specifically, all such pipelines in the library inherit from the DiffusionPipeline class.

The DiffusionPipeline class loads all the underlying models such as the UNet, text encoder, VAE, etc. For handling all things text encoder, we rely on `transformers.

load_sub_model() is the method that is responsible for loading the individual models involved in a pipeline with their checkpoints: https://github.com/huggingface/diffusers/blob/bb1b76d3bf9ef78a827086d1b9449975237ecbac/src/diffusers/pipelines/pipeline_utils.py#L419.

Now, in the context of this PR, we need to set force_hook=True while calling the load_checkpoint_and_dispatch() method of accelerate (more context is here in the last point). We can easily do so here for the models that are implemented in diffusers:

https://github.com/huggingface/diffusers/blob/bb1b76d3bf9ef78a827086d1b9449975237ecbac/src/diffusers/models/modeling_utils.py#L690C36-L690C64

But how do we handle the text encoders' case here as it comes from transformers?

I hope you now have the entire context and the clarity needed here. If not, let me know.

sayakpaul avatar Feb 23 '24 12:02 sayakpaul

Sorry for the delay @sayakpaul, I just tested the PR with the example you provided with num_inference_steps =50 and with/without --do_device_map and I get the same results (which is a good thing i guess). You can find the output of my terminal here. I'm using the main branch of transformers and accelerate. Let's try to find why it is not working on your end or maybe I did something wrong again. Note that force_hooks is only useful when the text_encoder is not on the same device as the unet since we need to move the output of the unet to the text_encoder model device.

SunMarc avatar Feb 26 '24 21:02 SunMarc

Thanks @SunMarc!

Note that force_hooks is only useful when the text_encoder is not on the same device as the unet since we need to move the output of the unet to the text_encoder model device.

I think we cannot assume that it won't be the case. So, it's best to have support for that, no?

just tested the PR with the example you provided with num_inference_steps =50 and with/without --do_device_map and I get the same results (which is a good thing i guess).

Definitely a good thing! However, I still get NaNs with device map enabled on audace (which has two 4090s) with from source installation of transformers and accelerate :( Could this be dependent on the number of devices we're testing on (and the sizes of the each)? For example, you are using three while I am using two.

sayakpaul avatar Feb 27 '24 08:02 sayakpaul

I think we cannot assume that it won't be the case. So, it's best to have support for that, no?

Yes ofc ! I was explaining why it worked on your side without having to set this force_hooks=True

Definitely a good thing! However, I still get NaNs with device map enabled on audace (which has two 4090s) with from source installation of transformers and accelerate :( Could this be dependent on the number of devices we're testing on (and the sizes of the each)? For example, you are using three while I am using two.

I'll do more tests on DGX and on audace to find the issue !

SunMarc avatar Feb 27 '24 15:02 SunMarc

I was able to find the issue. This is probably a nividia driver issue for 4090 GPU. See these threads : 1, 2 and 3. What happens is that with device_map, we move the data across gpus, and due to the following bug, the results gets corrupted.

import torch 
tensor_cpu = torch.tensor([[1,2,3]])
print(tensor_cpu)
tensor_1 = tensor_cpu.to(1)
print(tensor_1)
tensor_1_to_0 = tensor_1.to(0)
print(tensor_1_to_0)

Output :

tensor([[1, 2, 3]])
tensor([[1, 2, 3]], device='cuda:1')
tensor([[0, 0, 0]], device='cuda:0')

Disabling P2P communication (NCCL_P2P_DISABLE=0) doesn't work too. We can try to fix by upgrading to the latest driver but not sure it works. cc @muellerzr since I remember you faced this issue a while ago.

Nevertheless, let's continue this PR by testing it on DGX instead since this is a hardware issue. What we could be done is to move back the data to the cpu after performing the inference on each model but that would not be optimal.

tensor_cpu = torch.tensor([[1,2,3]])
tensor_1 = tensor_cpu.to(1)
# this works
print(tensor_1.to("cpu").to(0))

SunMarc avatar Feb 27 '24 22:02 SunMarc

@SunMarc updating cuda drivers will solve this :) (Tested on the 4090's):

No need for fancy env settings etc, just do python myscript.py

import torch 
from accelerate.utils import send_to_device
tensor_cpu = torch.tensor([[1,2,3]])
tensor_1 = send_to_device(tensor_cpu, "cuda:0")
print(tensor_1)
tensor_1_to_0 = send_to_device(tensor_1, "cuda:1")
print(tensor_1_to_0)
(accelerate) (base) zach@workhorse:~/work$ python test.py
tensor([[1, 2, 3]], device='cuda:0')
tensor([[1, 2, 3]], device='cuda:1')

muellerzr avatar Feb 27 '24 23:02 muellerzr

@sayakpaul thanks for the context. I am guessing you are loading the transformers model through the from_pretrained api of PreTrainedModel which does not really leaves the freedom for that so I would say some changes might be needed / manually looping on the transformers model and set the attribute 😞

ArthurZucker avatar Feb 28 '24 00:02 ArthurZucker

I am guessing you are loading the transformers model through the from_pretrained api of PreTrainedModel which does not really leaves the freedom for that so I would say some changes might be needed / manually looping on the transformers model and set the attribute 😞

Oh that's non-ideal. We try to not touch the stuff coming from transformers. @SunMarc do you have an idea on how we can do that? Pinging @yiyixuxu here to check if we're okay going via the route Arthur is suggesting to make device_map="auto" possible for diffusers pipelines.

sayakpaul avatar Feb 28 '24 01:02 sayakpaul

@SunMarc @muellerzr thanks so much!

I was able to find the issue. This is probably a nividia driver issue for 4090 GPU.

Should we add a check for the driver version at the beginning then?

Nevertheless, let's continue this PR by testing it on DGX instead since this is a hardware issue. What we could be done is to move back the data to the cpu after performing the inference on each model but that would not be optimal.

Yes, let's continue here now that we know the bug. Yeah, moving data is sub-optimal, so won't prefer that. Instead, we can catch the driver version bug early and inform our users about it.

What other changes you would like to see in the PR in terms of functionality? Do we need to tackle offloading separately now? If so, I would appreciate some further guidance. And how should we approach the tests here?

sayakpaul avatar Feb 28 '24 01:02 sayakpaul

I don't mind having a "feature" for that in transformers if this simplifies the way we load models and add some values overall! 🤗

ArthurZucker avatar Feb 28 '24 02:02 ArthurZucker

Yes, "auto" device-mapped pipelines is a requested feature and will greatly impact pipeline inference in diffusers. So, would appreciate that, @ArthurZucker!

sayakpaul avatar Feb 28 '24 10:02 sayakpaul

Should we add a check for the driver version at the beginning then?

I think it makes more sense to add it in accelerate since this is not an issue on diffusers side ! I will do a PR for that.

@SunMarc do you have an idea on how we can do that?

You can use the following function to add the hooks to the transformers model. dispatch_model(model, device_map=device_map, force_hooks=True). This will also be better I think to handle the cpu offload case.

What other changes you would like to see in the PR in terms of functionality? Do we need to tackle offloading separately now?

I think that we can add the following:

  • cpu offload: To enable that, we can map the modules that do not fit in the gpus to "cpu" and by using dispatch_model(model, device_map=device_map, force_hooks=True, main_device=0) if we want to offload the model and perform inference on the device 0 thanks to the hooks. You could also use cpu_offload which is the simplified version only for cpu offload. cpu_offload(model,execution_device=0). In diffusers, this corresponds to the enable_sequential_cpu_offload strategy. If we have enough space on the gpu (depending on how we design the device_map), we could cpu_offload_with_hook instead. I will let you decide what would be the best solution for diffusers !
  • Other device_map allocation strategy such as a sequential one. (compared to the balanced option that we have now ,we fill the first gpu to the max before moving the next one)

And how should we approach the tests here?

Since the device_map strategy used here is very simple (we are not splitting the model) and that accelerate + transformers have already a lot of tests to check that dispatch_model works. I think that we should focus on the following tests:

  • Test that _assign_components_to_devices and _load_empty_model works as expected We could also add slow tests to for multi-gpu + cpu_offload (if this breaks, we would also have issues in accelerate/transformers):
  • check that each model is on the right device + that they all have hooks
  • test inference to see if we get the same results vs is the pipeline is loaded on one device only.

SunMarc avatar Feb 28 '24 16:02 SunMarc

@sayakpaul hi, i got a question about device_map. why don't you let users use device_map like this device_map={ ' unet ': ' cuda:0 ',' vae ': ' cuda:1 ' }. Will there be any issues with using the device_map like this? and by the way, i pull the pipeline-device-map-auto branch, noticed that the controlnet is not be contained in device map.

zhangvia avatar Mar 01 '24 01:03 zhangvia

@sayakpaul hi, i got a question about device_map. why don't you let users use device_map like this device_map={ ' unet ': ' cuda:0 ',' vae ': ' cuda:1 ' }. Will there be any issues with using the device_map like this? and by the way, i pull the pipeline-device-map-auto branch, noticed that the controlnet is not be contained in device map.

We will phase out features for this slowly.

sayakpaul avatar Mar 01 '24 01:03 sayakpaul

@sayakpaul hi, i got a question about device_map. why don't you let users use device_map like this device_map={ ' unet ': ' cuda:0 ',' vae ': ' cuda:1 ' }. Will there be any issues with using the device_map like this? and by the way, i pull the pipeline-device-map-auto branch, noticed that the controlnet is not be contained in device map.

We will phase out features for this slowly.

got it , thank you for your great work. and besides, i use the pipeline-device-map-auto branch to test the controlnet img2img pipeline, and upgrade the transformer, accelerate to the newest version. but don't know why my server crashed and reboot automatically. i'm using a k8s pod on a compute cluster. but it''s not the pod reboot, but the whole node server reboot :cry:

zhangvia avatar Mar 01 '24 01:03 zhangvia

@SunMarc I have now executed the DGX, and the results are all good. I have added dispatch_model() for handling force_hooks to the models coming from Transformers as well. LMK if that is how you had envisioned it.

Before I make changes for adding offloading support, want to discuss a few more things:

We could maybe provide three variants for each of the device mapping strategies we plan to support. Let's consider the current one: "balanced" (we're calling it "auto" for now, correct?), "balanced_low_memory", "balanced_ultra_low_memory", covering the offloading scenarios you mentioned here, respectively:

cpu offload: To enable that, we can map the modules that do not fit in the gpus to "cpu" and by using dispatch_model(model, device_map=device_map, force_hooks=True, main_device=0) if we want to offload the model and perform inference on the device 0 thanks to the hooks. You could also use cpu_offload which is the simplified version only for cpu offload. cpu_offload(model,execution_device=0). In diffusers, this corresponds to the enable_sequential_cpu_offload strategy. If we have enough space on the gpu (depending on how we design the device_map), we could cpu_offload_with_hook instead. I will let you decide what would be the best solution for diffusers !

"balanced" will be the default case, is being done currently. "balanced_low_memory" will leverage cpu_offload_with_hook(). "balanced_ultra_low_memory" will leverage cpu_offload().

I believe with proper documentation, we can make it abundantly clear to the users which one they should use. WDYT?

I think we cannot delegate the offloading functionality when using device maps to enable_model_cpu_offload() or enable_sequential_cpu_offload(). It will introduce more complexity. WDYT?

Other device_map allocation strategy such as a sequential one. (compared to the balanced option that we have now ,we fill the first gpu to the max before moving the next one)

This could be done in a future PR. I would like to first gauge community interest with a simple and reasonable strategy. WDYT?

Thanks for suggesting the test suite. Will add it after we settle on the above.

sayakpaul avatar Mar 01 '24 10:03 sayakpaul

"balanced" will be the default case, is being done currently. "balanced_low_memory" will leverage cpu_offload_with_hook(). "balanced_ultra_low_memory" will leverage cpu_offload().

I believe with proper documentation, we can make it abundantly clear to the users which one they should use. WDYT?

I think we cannot delegate the offloading functionality when using device maps to enable_model_cpu_offload() or enable_sequential_cpu_offload(). It will introduce more complexity. WDYT?

Sounds good to me ! I agree with your point on enable_model_cpu_offload() and enable_sequential_cpu_offload(). It will be easier to manage to directly use cpu_offload_with_hook() and cpu_offload(). Could you describe the strategy for balanced_low_memory and balanced_ultra_low_memory ? With the balanced default case, do you intend to perform cpu offload also if the models do not fit the gpu (we do that in transformers) or the users needs to use balanced_ultra_low_memory in that case ?

This could be done in a future PR. I would like to first gauge community interest with a simple and reasonable strategy. WDYT?

Thanks for suggesting the test suite. Will add it after we settle on the above.

Yes, we can add additional strategies in future PRs if the community ask for it !

SunMarc avatar Mar 01 '24 16:03 SunMarc