tts-generation-webui icon indicating copy to clipboard operation
tts-generation-webui copied to clipboard

VRAM is not freed on errors

Open rofoto opened this issue 10 months ago • 3 comments

When using musicgen the process completes and all files are created, but I have blocked network traffic out and this causes an error. If you enable multi band after this, there is a chance that there will not be enough vram for it.

Not sure this is intended behavior since there seems to be a delayed cleanup between runs.

This error does not effect the outputs but it also puts the GPU in a state where VRAM is not freed, forcing a restart.

This is not the only error that puts the GPU in this state. It appears that pretty much any error, including but not limited to; torch.cuda.OutOfMemoryError and errors when trying to download models puts the GPU in this state.

rofoto avatar Apr 10 '24 23:04 rofoto

Thanks for the tip, the models should definitely free memory on failure. I think some of them would free it on next load, but that's not ideal.

As for the issue itself, do you see any error messages in the console? I don't think musicgen should be doing telemetry and I have disabled Gradio telemetry. (By the way I have almost no idea what people are actually using.)

On Thu, Apr 11, 2024, 8:48 AM rofoto @.***> wrote:

When using musicgen the process completes and all files are created, but I have blocked network traffic out and this causes an error when ( i assume ) musicgen tries to send out telemetry.

This error does not effect the outputs but it also puts the GPU in a state where VRAM is not freed, forcing a restart.

This is not the only error that puts the GPU in this state. It appears that pretty much any error, including but not limited to; torch.cuda.OutOfMemoryError and errors when trying to download models puts the GPU in this state.

— Reply to this email directly, view it on GitHub https://github.com/rsxdalv/tts-generation-webui/issues/303, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABTRXI2F7QKQBFO7GKNVSJLY4XFTHAVCNFSM6AAAAABGBKBVEKVHI2DSMVQWIX3LMV43ASLTON2WKOZSGIZTMNRRGAYTMNI . You are receiving this because you are subscribed to this thread.Message ID: @.***>

rsxdalv avatar Apr 11 '24 01:04 rsxdalv

this is the only other error I am seeing

"tts-6.0_webui\installer_files\env\lib\asyncio\proactor_events.py", line 165, in _call_connection_lost
    self._sock.shutdown(socket.SHUT_RDWR)
ConnectionResetError: [WinError 10054] An existing connection was forcibly closed by the remote host. 

That's why I thought telemetry was the potential issue.

I think some of them would free it on next load, but that's not ideal. After looking, it would appear that in some situations the vram is freed on the next run but ideally it can be cleared at the end of generation, just in case.

rofoto avatar Apr 12 '24 05:04 rofoto

That error is because you still have a frontend and a server. This is generally from gradio. Although it's not impossible that this could from telemetry at some point, it's a different situation.

On Fri, Apr 12, 2024, 2:59 PM rofoto @.***> wrote:

this is the only other error I am seeing

"tts-6.0_webui\installer_files\env\lib\asyncio\proactor_events.py", line 165, in _call_connection_lost self._sock.shutdown(socket.SHUT_RDWR) ConnectionResetError: [WinError 10054] An existing connection was forcibly closed by the remote host.

That's why I thought telemetry was the potential issue.

I think some of them would free it on next load, but that's not ideal. After looking, it would appear that in some situations the vram is freed on the next run but ideally it can be cleared at the end of generation, just in case.

— Reply to this email directly, view it on GitHub https://github.com/rsxdalv/tts-generation-webui/issues/303#issuecomment-2051035154, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABTRXI2VIKMSGVOLN36M7OTY45Z43AVCNFSM6AAAAABGBKBVEKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDANJRGAZTKMJVGQ . You are receiving this because you commented.Message ID: @.***>

rsxdalv avatar Apr 12 '24 06:04 rsxdalv