InvokeAI icon indicating copy to clipboard operation
InvokeAI copied to clipboard

[bug]: DPM Samplers Seem To Never Converge On MPS?

Open ThatParticularSteveGraham opened this issue 3 years ago • 7 comments

Is there an existing issue for this?

  • [X] I have searched the existing issues

OS

macOS

GPU

mps

VRAM

16GB

What happened?

Running prompt 'banana sushi' -S42 in k_dpm_2 and k_dpm_2_a at 17, 27, and 37 steps seems to indicate that SD will never converge (decide?) on a particular image.

k_dpm_2 at 17 steps 000069 c8e40498 42

k_dpm_2 at 27 steps 000070 92a53b5e 42

k_dpm_2 at 37 steps 000071 f07743d8 42

Screenshots

No response

Additional context

Seeing this on my system since the 2.0 release, including the 2.1 release candidate I cloned yesterday on 11/1. Anyone else? I can reproduce this with as many as 70 steps (making my poor M1 Pro MacBook very melty). k_heun sampler does not exhibit this behavior, BTW.

Contact Details

No response

Thanks for the report. I’ll try to reproduce the error and suggest a fix.

lstein avatar Nov 03 '22 21:11 lstein

+1 I see the same behavior.

victorca25 avatar Nov 05 '22 10:11 victorca25

@lstein Is this possibly related to the rcently-discovered regression in reproducibility?

psychedelicious avatar Nov 23 '22 00:11 psychedelicious

A datapoint - I took the Automatic1111 distro for a brief spin, and DPM2 and 2a behave the same there, so I wonder if the issue has more to do with dependencies (PyTorch?) + my hardware than InvokeAI (i.e., a me problem).

I'm pretty sure I've been hearing about this for a long time, maybe since the beginning of the k_diffuser import. Someone with an MPS system should check out an old version and see if it was there.

Lincoln

On Sat, Nov 5, 2022 at 6:25 AM victorca25 @.***> wrote:

+1 I see the same behavior.

— Reply to this email directly, view it on GitHub https://github.com/invoke-ai/InvokeAI/issues/1350#issuecomment-1304487302, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAA3EVLUM4LU4PY36QVP3C3WGYY2PANCNFSM6AAAAAARVI7KYA . You are receiving this because you were assigned.Message ID: @.***>

--

Lincoln Stein

Head, Adaptive Oncology, OICR

Senior Principal Investigator, OICR

Professor, Department of Molecular Genetics, University of Toronto

Tel: 416-673-8514

Cell: 416-817-8240

@.***

E**xecutive Assistant

Michelle Xin

Tel: 647-260-7927

@.*** @.**>

Ontario Institute for Cancer Research

MaRS Centre, 661 University Avenue, Suite 510, Toronto, Ontario, Canada M5G 0A3

@OICR_news https://can01.safelinks.protection.outlook.com/?url=https%3A%2F%2Ftwitter.com%2Foicr_news&data=04%7C01%7CMichelle.Xin%40oicr.on.ca%7C9fa8636ff38b4a60ff5a08d926dd2113%7C9df949f8a6eb419d9caa1f8c83db674f%7C0%7C0%7C637583553462287559%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=PS9KzggzFoecbbt%2BZQyhkWkQo9D0hHiiujsbP7Idv4s%3D&reserved=0 | www.oicr.on.ca

Collaborate. Translate. Change lives.

This message and any attachments may contain confidential and/or privileged information for the sole use of the intended recipient. Any review or distribution by anyone other than the person for whom it was originally intended is strictly prohibited. If you have received this message in error, please contact the sender and delete all copies. Opinions, conclusions or other information contained in this message may not be that of the organization.

lstein avatar Nov 30 '22 19:11 lstein

I think I follow. Do you mean check out an old version of InvokeAI before you started using k_diffuser? Or check out an older version of k_diffuser? Sorry, machine learning isn't my background and this is all quite... Exotic to me 😳

I looked into this and I have a fix which restores the four k_dpm* samplers which are currently broken on Apple Silicon. The fix is simple, but I think input is needed first from project maintainers and this is why I am not sending a PR yet.

The root cause: in PyTorch 1.12.1 there is a bug with indexing on 'mps' device, if followed by a function activation.

> torch.tensor([1.,5.], device='mps').log()
tensor([0.0000, 1.6094], device='mps:0')
> torch.tensor([1.,5.], device='mps')[1].log()
tensor(0., device='mps:0')  # should be 1.6094

This bug is fixed in PyTorch 1.13, but that version has a more serious issue as to make InvokeAI completely unusable for me. (I suspect memory management, constant swapping). Related: https://github.com/pytorch/pytorch/issues/89784

Back to 1.12.1: the bug affects InvokeAI through 'log' on a single 'sigma' - https://github.com/Birch-san/k-diffusion/blob/mps/k_diffusion/sampling.py#L590 The sigmas array is a one-dimensional mps-tensor, and thus suffers from the pytorch 1.12.1 bug.

My suggested solution: move this tensor to 'cpu'. It is small, but more importantly - its dimension ('time') is unrelated to any of the dimensions of the actual tensors at play: there is no real reason for it to be on the gpu (or even a tensor, for that matter). For example, here, add .to('cpu'): https://github.com/invoke-ai/InvokeAI/blob/main/ldm/models/diffusion/ksampler.py#L196

The complication: the 'to_d' function that is used in several samplers treat a scalar 'sigma' value as a tensor, which causes an error for mixed gpu/cpu tensor computation. https://github.com/Birch-san/k-diffusion/blob/mps/k_diffusion/sampling.py#L48 (I don't understand why the dimension adding before division is needed, can someone please elucidate?) The workaround there is to move the single-sigma to 'mps' by adding .to('mps').

(PS: There's already an MPS workaround in 'to_d', very possibly related: https://github.com/Birch-san/k-diffusion/blob/mps/k_diffusion/utils.py#L46 )

Hope this helps!

mebelz avatar Dec 14 '22 07:12 mebelz