StreamDiffusion
StreamDiffusion copied to clipboard
Demo img2img webcam browser
Demo Notes
- Frontend in Svelte
- Backend: FastAPI / WebSocket / MJPEG stream
- Wrapper is copied and modified here to accept a prompt for
img2img
andengine_dir
, allowing me to specify the directory and reuse the compiled model in the Docker environment.
All the StreamDiffusion code is on img2img.py. Please feel free to add any speedup suggestions.
I'm using t_index_list=[35, 45]
. Is there a way to provide a strength
on a 0-1 scale?
We will make a noise scheduler function and realize parametric operations such as noise strength 0.0-1.0 and num_denoising_steps 1-50. We will also do the review for the PR soon!
hi @GradientSurfer, thanks for detailed response! I really appreciate the feedback, I'm happy to address some of your points on this PR, also if you're interested in collaborating, please send edits, commits. do you have PR edit access?
Looks cool, nice work @radames! I've been tinkering with a strikingly similar set of changes, but using a canvas with drawing tools instead of webcam input.
If you and the team don't mind unsolicited feedback, I'll leave a review and share a few suggestions/thoughts that I hope you find helpful:
- Batch inference It appears image frames are processed one at a time in this demo, but batching multiple frames together for higher throughput (& FPS) should result in a smoother experience (at the expense of increased latency).
Addressing number 2 here, we can try a batching approach!
- Circular buffer & continuous streaming It looks like the server requests the client to send a frame - instead of this request/response cycle, the client could continuously stream image frames to the server which would maintain a circular buffer that can then be used to perform batch inference. Notably
examples/screen/main.py
uses that approach.
Ohh yes that makes a lot of sense, In my original demo with LCM, I did use an async queue()
, but back in time, the inference was slow, and the result was a lagged video, thus I decided to switch to ping/pong approach.
- No separate endpoint/stream for returning generated images The generated image could be returned to the client via the websocket connection, instead of via a separate API endpoint. This could be a minor code simplification, and notably would sidestep the linked chromium bug (so we could avoid sending frames twice to every browser that isn't firefox). Yes, that's a great point, MJPEG stream seems a bit awkward, and buggy on Chrome, Ideally it would be on WebRTC, but I was looking into performance and simplicity. An open socket jpeg streaming looked faster to me compared to sending blobs over websocket, it needs an extra decoding processing to send the bytes to the
<img>
. However, this demo seems very fast and it's doing the blob over websockets -><img>
https://www.fal.ai/camera
- Return raw pixels in RGBA format Generated images can be returned to the client in raw RGBA pixel format and then directly written to the canvas. This may be a relatively minor optimization, but it avoids the overhead of transforming to & from JPEG format and any associated lossy compression.
Yes you're right, however I did the canvas to normalize the webcam image, cropping on the desired size, this could be done on the backend, whichever is faster.
- Integrate
wrapper.py
modifications The modifications to accept a prompt for img2img and to supportengine_dir
are great and seem well contained, those ought to be integrated into the canonicalwrapper.py
so there is no unnecessary duplication of code or maintenance burden. Done on PR #66 , and when it's merged I can update it here.
Perhaps these ideas could be addressed here or in future PRs (or not at all), either way I'd be happy to discuss or collaborate further on details - feel free to reach out.
@radames I do not have PR edit access here, @cumulo-autumn perhaps you would consider granting collaborator access?
hi @cumulo-autumn , I think it's good now. I've fixed a couple of uncaught exceptions. One important note, while the server and the client were designed to accept multiple queued connections, the wrapper and StreamDiffusionWrapper
are not working well in that regard, i.e. the buffer across stream.stream
is shared, so if you open multiple browser tabs and switch the prompt and webcams, you'll notice the images are leaking across tabs. For instance when using diffusers pipe(...)
it's possible to have queue calls, as long as they're quick inference, example here.
ps. please pull and test again if you can on Windows
@radames I do not have PR edit access here, @cumulo-autumn perhaps you would consider granting collaborator access?
Hi @GradientSurfer . Thank you for your valuable PR submissions in the past, and for your many meaningful suggestions this time as well! Regarding PR edit access, currently, we are keeping it within a group of acquaintances, so please allow us to hold off on adding new PR edit access for now. However, we are very much open to more discussions and PRs in the future, so we definitely want to continue doing those! (I apologize for the late response this time, as it has been a busy end-of-year period. Also, I really appreciate your prompt and valuable feedback on this PR.) We will consider our policy on adding new PR edit access in the future!
hi @cumulo-autumn , I think it's good now. I've fixed a couple of uncaught exceptions. One important note, while the server and the client were designed to accept multiple queued connections, the wrapper and
StreamDiffusionWrapper
are not working well in that regard, i.e. the buffer acrossstream.stream
is shared, so if you open multiple browser tabs and switch the prompt and webcams, you'll notice the images are leaking across tabs. For instance when using diffuserspipe(...)
it's possible to have queue calls, as long as they're quick inference, example here. ps. please pull and test again if you can on Windows
Hi @radames. Thank you for the update! It works perfectly in my environment too! I am going to merge it.