so-vits-svc-fork
so-vits-svc-fork copied to clipboard
Error when running multiple CUDA GPUs
Hello I've been trying to run svc with 4 A6000 GPUs and am currently receiving the error below. When I check what is running on port 6006 it is svc train -t. I also am able to get it to run if only 1 GPU is used.
INFO: Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/4 [18:49:33] INFO [18:49:33] Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/4 distributed.py:244 [18:49:37] INFO [18:49:37] Created a temporary directory at /tmp/tmpwx1t3esk instantiator.py:21 INFO [18:49:37] Writing /tmp/tmpwx1t3esk/_remote_module_non_scriptable.py instantiator.py:76 [18:49:37] INFO [18:49:37] Created a temporary directory at /tmp/tmpd24vrta_ instantiator.py:21 INFO [18:49:37] Writing /tmp/tmpd24vrta_/_remote_module_non_scriptable.py instantiator.py:76 [18:49:37] INFO [18:49:37] Created a temporary directory at /tmp/tmpphd252jb instantiator.py:21 INFO [18:49:37] Writing /tmp/tmpphd252jb/_remote_module_non_scriptable.py instantiator.py:76 INFO [18:49:37] Server binary (from Python package v0.7.0): server_ingester.py:290 /home/venv/lib/python3.8/site-packages/tensorboard_data_server/bin/server /home/venv/lib/python3.8/site-packages/tensorboard_data_server/bin/server: /usr/lib/x86_64-linux-gnu/libc.so.6: version
GLIBC_2.33' not found (required by /home/venv/lib/python3.8/site-packages/tensorboard_data_server/bin/server)
/home/venv/lib/python3.8/site-packages/tensorboard_data_server/bin/server: /usr/lib/x86_64-linux-gnu/libc.so.6: version GLIBC_2.34' not found (required by /home/venv/lib/python3.8/site-packages/tensorboard_data_server/bin/server) /home/venv/lib/python3.8/site-packages/tensorboard_data_server/bin/server: /usr/lib/x86_64-linux-gnu/libc.so.6: version
GLIBC_2.32' not found (required by /home/venv/lib/python3.8/site-packages/tensorboard_data_server/bin/server)
INFO [18:49:37] Server binary (from Python package v0.7.0): server_ingester.py:290
/home/venv/lib/python3.8/site-packages/tensorboard_data_server/bin/server
/home/venv/lib/python3.8/site-packages/tensorboard_data_server/bin/server: /usr/lib/x86_64-linux-gnu/libc.so.6: version GLIBC_2.33' not found (required by /home/venv/lib/python3.8/site-packages/tensorboard_data_server/bin/server) /home/venv/lib/python3.8/site-packages/tensorboard_data_server/bin/server: /usr/lib/x86_64-linux-gnu/libc.so.6: version
GLIBC_2.34' not found (required by /home/venv/lib/python3.8/site-packages/tensorboard_data_server/bin/server)
/home/venv/lib/python3.8/site-packages/tensorboard_data_server/bin/server: /usr/lib/x86_64-linux-gnu/libc.so.6: version GLIBC_2.32' not found (required by /home/venv/lib/python3.8/site-packages/tensorboard_data_server/bin/server) [18:49:38] INFO [18:49:38] Server binary (from Python package v0.7.0): server_ingester.py:290 /home/venv/lib/python3.8/site-packages/tensorboard_data_server/bin/server /home/venv/lib/python3.8/site-packages/tensorboard_data_server/bin/server: /usr/lib/x86_64-linux-gnu/libc.so.6: version
GLIBC_2.33' not found (required by /home/venv/lib/python3.8/site-packages/tensorboard_data_server/bin/server)
/home/venv/lib/python3.8/site-packages/tensorboard_data_server/bin/server: /usr/lib/x86_64-linux-gnu/libc.so.6: version GLIBC_2.34' not found (required by /home/venv/lib/python3.8/site-packages/tensorboard_data_server/bin/server) /home/venv/lib/python3.8/site-packages/tensorboard_data_server/bin/server: /usr/lib/x86_64-linux-gnu/libc.so.6: version
GLIBC_2.32' not found (required by /home/venv/lib/python3.8/site-packages/tensorboard_data_server/bin/server)
Address already in use
Port 6006 is in use by another program. Either identify and stop that program, or start the server with a different port.
Address already in use
Port 6006 is in use by another program. Either identify and stop that program, or start the server with a different port.
Address already in use
Port 6006 is in use by another program. Either identify and stop that program, or start the server with a different port.`
I was able to fix this issue by running svc train
it appears that svc train -t
was trying to run the tensorboard_data_server for each gpu and trying to use the same port every time.
I'm running into the same error using four 24GB A10G... With only one A10G it worked just fine... However, with four of them I'm getting a similar error when executing "svc train -t" (see image)
I tried executing "svc train" but it didn't help....
Did you do anything else different?
Thanks a lot!
Please reopen this. This is not fixed and without tensorboard a lot of monitoring feature is lost.
You can just call tensorboard manually
tensorboard --logdir logs/44k