Option to run live preview in a CUDA stream for better performance.
This adds a new CLI flag, --preview-stream, which will attempt to use a CUDA stream to run whatever live preview method is in use and to handle transfer of the preview image back to the host. Running the preview model in parallel helps slightly with covering up tail effects and gaps between kernel executions on the main thread, but the main advantage is that this is actually the only safe way to transfer tensors from device to host asynchronously, which you will want to do for live previews. With this enabled on a device that can support it, live previews have basically no noticeable performance impact.
I did correct a minor oversight in the preview_to_image function in that it was not rounding before casting to uint8, default pytorch behavior is to truncate floats which effectively makes everything half a shade darker than it should be, and notably you'll almost never see 255 show up. The correct thing to do for image processing is to round to the nearest int. But that's fixed now. Other than that nothing happens differently unless you enable the flag.
If the flag is enabled, the live preview update is done on a daemon thread, because we do need a separate thread to actually carry on with the default stream's execution after we move the tensor. Probably not the most clean way to do it but I have tested it and it does work. To avoid issues I don't spawn a thread for it if we're not trying to use a cudastream.
Nsight systems report for before (it's on a separate stream here but not on its own thread so it doesn't make a difference yet, and this is also with other things to remove blocking calls that I have not yet finished and PR'd):
After breaking it off into a thread (the time is different because this was a different batch size, but that doesn't change the sequence and overlap generally). Note that the memcpy is no longer able to block the main thread. Areas where both streams overlap have more consistent 100% GPU utilization.
currently finding that this breaks interrupting due to the extra thread, will have to find a way to fix that before this is merged.
What do you think of this: https://github.com/comfyanonymous/ComfyUI/pull/6124
What do you think of this: #6124
There's still a few issues that I have to chase down but I can tell that this code isn't safe because it has no synchronization point. It seems to do a very good job of delaying the use of the CPU tensor which I think I can try to adapt some things from, but the author's complaints of occasional incorrect results is a dead giveaway that it's malfunctioning similarly to how my earlier attempts at doing this work. GPU to CPU transfers always need a sync to be safe, and doing it on a separate thread and stream is pretty much the only way to make that happen without blocking the default stream at some point.