so-vits-svc-fork icon indicating copy to clipboard operation
so-vits-svc-fork copied to clipboard

Error when running multiple CUDA GPUs

Open rvega20 opened this issue 1 year ago • 4 comments

Hello I've been trying to run svc with 4 A6000 GPUs and am currently receiving the error below. When I check what is running on port 6006 it is svc train -t. I also am able to get it to run if only 1 GPU is used.

INFO: Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/4 [18:49:33] INFO [18:49:33] Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/4 distributed.py:244 [18:49:37] INFO [18:49:37] Created a temporary directory at /tmp/tmpwx1t3esk instantiator.py:21 INFO [18:49:37] Writing /tmp/tmpwx1t3esk/_remote_module_non_scriptable.py instantiator.py:76 [18:49:37] INFO [18:49:37] Created a temporary directory at /tmp/tmpd24vrta_ instantiator.py:21 INFO [18:49:37] Writing /tmp/tmpd24vrta_/_remote_module_non_scriptable.py instantiator.py:76 [18:49:37] INFO [18:49:37] Created a temporary directory at /tmp/tmpphd252jb instantiator.py:21 INFO [18:49:37] Writing /tmp/tmpphd252jb/_remote_module_non_scriptable.py instantiator.py:76 INFO [18:49:37] Server binary (from Python package v0.7.0): server_ingester.py:290 /home/venv/lib/python3.8/site-packages/tensorboard_data_server/bin/server /home/venv/lib/python3.8/site-packages/tensorboard_data_server/bin/server: /usr/lib/x86_64-linux-gnu/libc.so.6: version GLIBC_2.33' not found (required by /home/venv/lib/python3.8/site-packages/tensorboard_data_server/bin/server) /home/venv/lib/python3.8/site-packages/tensorboard_data_server/bin/server: /usr/lib/x86_64-linux-gnu/libc.so.6: version GLIBC_2.34' not found (required by /home/venv/lib/python3.8/site-packages/tensorboard_data_server/bin/server) /home/venv/lib/python3.8/site-packages/tensorboard_data_server/bin/server: /usr/lib/x86_64-linux-gnu/libc.so.6: version GLIBC_2.32' not found (required by /home/venv/lib/python3.8/site-packages/tensorboard_data_server/bin/server) INFO [18:49:37] Server binary (from Python package v0.7.0): server_ingester.py:290 /home/venv/lib/python3.8/site-packages/tensorboard_data_server/bin/server
/home/venv/lib/python3.8/site-packages/tensorboard_data_server/bin/server: /usr/lib/x86_64-linux-gnu/libc.so.6: version GLIBC_2.33' not found (required by /home/venv/lib/python3.8/site-packages/tensorboard_data_server/bin/server) /home/venv/lib/python3.8/site-packages/tensorboard_data_server/bin/server: /usr/lib/x86_64-linux-gnu/libc.so.6: version GLIBC_2.34' not found (required by /home/venv/lib/python3.8/site-packages/tensorboard_data_server/bin/server) /home/venv/lib/python3.8/site-packages/tensorboard_data_server/bin/server: /usr/lib/x86_64-linux-gnu/libc.so.6: version GLIBC_2.32' not found (required by /home/venv/lib/python3.8/site-packages/tensorboard_data_server/bin/server) [18:49:38] INFO [18:49:38] Server binary (from Python package v0.7.0): server_ingester.py:290 /home/venv/lib/python3.8/site-packages/tensorboard_data_server/bin/server /home/venv/lib/python3.8/site-packages/tensorboard_data_server/bin/server: /usr/lib/x86_64-linux-gnu/libc.so.6: version GLIBC_2.33' not found (required by /home/venv/lib/python3.8/site-packages/tensorboard_data_server/bin/server) /home/venv/lib/python3.8/site-packages/tensorboard_data_server/bin/server: /usr/lib/x86_64-linux-gnu/libc.so.6: version GLIBC_2.34' not found (required by /home/venv/lib/python3.8/site-packages/tensorboard_data_server/bin/server) /home/venv/lib/python3.8/site-packages/tensorboard_data_server/bin/server: /usr/lib/x86_64-linux-gnu/libc.so.6: version GLIBC_2.32' not found (required by /home/venv/lib/python3.8/site-packages/tensorboard_data_server/bin/server) Address already in use Port 6006 is in use by another program. Either identify and stop that program, or start the server with a different port. Address already in use Port 6006 is in use by another program. Either identify and stop that program, or start the server with a different port. Address already in use Port 6006 is in use by another program. Either identify and stop that program, or start the server with a different port.`

rvega20 avatar May 10 '23 19:05 rvega20

I was able to fix this issue by running svc train it appears that svc train -t was trying to run the tensorboard_data_server for each gpu and trying to use the same port every time.

rvega20 avatar May 10 '23 20:05 rvega20

I'm running into the same error using four 24GB A10G... With only one A10G it worked just fine... However, with four of them I'm getting a similar error when executing "svc train -t" (see image)

Bildschirmfoto 2023-05-12 um 10 11 58

I tried executing "svc train" but it didn't help....

Did you do anything else different?

Thanks a lot!

TipsTricksMore avatar May 12 '23 08:05 TipsTricksMore

Please reopen this. This is not fixed and without tensorboard a lot of monitoring feature is lost.

dogtopus avatar Nov 29 '23 04:11 dogtopus

You can just call tensorboard manually

tensorboard --logdir logs/44k

34j avatar Dec 04 '23 09:12 34j