StreamDiffusion
StreamDiffusion copied to clipboard
No MPS support right?
Just to be clear, this repo is for CUDA enabled devices only, correct? On initially testing, mps doesn't seem to work.
Yes, that is correct.
MPS is not supported. However, if we can further speed up the process using MPS, we will try it.
If you know anything about it, We would appreciate your advice.
In case someone wondering for a start or need a project tryout on their Mac machine.
To run image to image or text to image from the readme example without acceleration:
pipe.enable_xformers_memory_efficient_attention() # <-- NADA, remove/comment this
and pipe the model to "mps":
pipe = StableDiffusionPipeline.from_pretrained("KBlueLeaf/kohaku-v2.1").to(
device=torch.device("mps"),
dtype=torch.float16,
)
I'm not sure about xformers, I'm not an expert, but check the issue as it might be not needed.
Had to modify the class StreamDiffusion
__call__
method in a pipeline to conditionally run cuda events wrapping...
Somewhere in .../StreamDiffusion/venv/lib/python3.xx/site-packages/streamdiffusion/pipeline.py
if installed into venv
via pip install .
from the repo root...
@torch.no_grad()
# condition hack event sync/track for non-cuda devices, RIP profiling etc
def __call__(
self, x: Union[torch.Tensor, PIL.Image.Image, np.ndarray] = None
) -> torch.Tensor:
if self.device == "cuda":
start = torch.cuda.Event(enable_timing=True)
end = torch.cuda.Event(enable_timing=True)
start.record()
if x is not None:
x = self.image_processor.preprocess(x, self.height, self.width).to(
device=self.device, dtype=self.dtype
)
if self.similar_image_filter:
x = self.similar_filter(x)
if x is None:
time.sleep(self.inference_time_ema)
return self.prev_image_result
x_t_latent = self.encode_image(x)
else:
# TODO: check the dimension of x_t_latent
x_t_latent = torch.randn((1, 4, self.latent_height, self.latent_width)).to(
device=self.device, dtype=self.dtype
)
x_0_pred_out = self.predict_x0_batch(x_t_latent)
x_output = self.decode_image(x_0_pred_out).detach().clone()
self.prev_image_result = x_output
if self.device == "cuda":
end.record()
torch.cuda.synchronize()
inference_time = start.elapsed_time(end) / 1000
self.inference_time_ema = 0.9 * self.inference_time_ema + 0.1 * inference_time
return x_output
@leezenn Thanks for the suggestion. How did you install streamdiffusion library? I guess in installation Step 3, we need to remove [tensorrt]
right? Do we need to do extra steps?
@ifsheldon I've installed it via pip install .
git+https://github.com/cumulo-autumn/StreamDiffusion.git@main#egg=streamdiffusion[tensorrt]
<- didn't work AFAIR.
Here are all steps I performed at the project root (you can copy and execute this shell script) (outdated read till the very end first):
python -m venv venv
source venv/bin/activate
pip install --upgrade pip
pip install --pre torch torchvision --extra-index-url https://download.pytorch.org/whl/nightly/cpu
pip install wheel
pip install xformers
pip install accelerate
pip install .
deactivate
Step 3, we need to remove
[tensorrt]
right?
I'm not sure though. Do you? (UPD) Yeah, as it doesn't work. No Nvidia, RIP.
As for the demo server I have changed it to sfast
in the config at .../StreamDiffusion/demo/realtime-txt2img/server/config.py
:
# ...
device: torch.device = torch.device("mps")
# ...
acceleration: Literal["none", "xformers", "tensorrt"] = "sfast"
# ...
and run:
source venv/bin/activate
cd demo/realtime-txt2img/
pip install -r requirements.txt
cd view && npm install && npm run build && cd ..
cd server && python main.py
deactivate
cd ../../../
Note:
~~I had to install xformers
with preinstalled wheel
(installation fails without it) - check the installation steps above - in order for it to work.
OR it was the accelerate
I don't remember at this point. 😮💨~~
Nevermind, just
- (Re)installed everything without
xformers
and (optional)accelerate
:
python -m venv venv
source venv/bin/activate
pip install --upgrade pip
pip install --pre torch torchvision --extra-index-url https://download.pytorch.org/whl/nightly/cpu
# pip install wheel
# pip install xformers
# pip install accelerate
pip install .
deactivate
- Then made a conditional switch in my comment above. That's it.
For the Torch I've used the Apple guide.
@leezenn Thanks a lot! I've successfully run it. But I wonder if you can run it with sfast
? I don't know what it is. I cannot find it anywhere, in code or on Pypi.
from sfast.compilers.stable_diffusion_pipeline_compiler import CompilationConfig, compile
this seems to import something from nowhere.
@ifsheldon Sorry for the delay.
from sfast.compilers.stable_diffusion_pipeline_compiler import CompilationConfig, compile this seems to import something from nowhere.
I know right? :)
I'm glad you're confused too as I am.
I haven't dig into this much, but yeah, repo is missing the compiler part. Quick search oh Github gives me this from this repo. I didn't spend time investigating what it does. I just used the tip from the docstring, which contains it, to try it out. And it just silently runs (it may log on another level, I don't know), then I saw inconsistency with docstrings. So... this is not well coocked (yet?). I just left it be. I wasn't particulary patient with it, I'm sorry.
The project seem to be promising tho. ❤️
@leezenn @ifsheldon I'm not sure If I set everything up in the correct way, but I at least got it working after following your conversation.
I was wondering what kind of speed you are getting from this? Running the txt2img demo, for me takes around 5-10 seconds till it starts producing images, then it shoots out images every 1-2 seconds and then again takes approx 10 seconds after new input.
Im on a M3-Pro 36GB - expecting real-time generation will just stay a far away dream I guess?
@odonald
for me takes around 5-10 seconds till it starts producing images
Most likely due to so called warmup runs.
then it shoots out images every 1-2 seconds and then again takes approx 10 seconds after new input
Similar effect on my M1Pro. As far as I remember, it was running on GPU, but had some problems with uRAM, probably even memory leak. So, it started to hit SSD swap and crawl instead of running... I haven't investigate any further/closer after a couple of runs - don't take my words seriously, it's a surface look assumptions.
Don't think your dream is far away thought. I saw some projects that was redesigned for the Apple Silicon machines series, somewhere along ggerganov et al with their special tools. Something like this project...
So, you can try and adapt current project using it or wait till someone will do that. If maintainers will constantly care to improve this project, I believe that somebody eventually come to make a proper MPS support unless there is a better alternative.
@leezenn did you PR the hack? Tbh works perfectly on my M1 Pro, no performance decrease or aforementioned issues of lag/delay.
I've created this gist to help guide the setup and running the demos.
@leezenn did you PR the hack? Tbh works perfectly on my M1 Pro, no performance decrease or aforementioned issues of lag/delay.
I did not. Maybe someone else did.