rvc-webui icon indicating copy to clipboard operation
rvc-webui copied to clipboard

マルチGPUで v2 のトレーニング中にエラーが出る

Open sugarkwork opened this issue 1 year ago • 1 comments

マルチGPUのマシンでトレーニングを実施するとエラーが出て停止しました。

トレーニングの設定内容は以下の通りです。

Model version: v2 Target sampling rate: 40k f0 Model: Yes Using phone embedder: contentvec Embedding channels: 768 Embedding output layer: 12 GPU ID: 0, 1 Number of CPU processes: 8 Normalize audio volume when preprocess: Yes Pitch extraction algorithm: harvest Batch side: 14 Number of epochs: 40 Save every epoch: 10 Cache batch: Yes FP16: Yes

以下のようなエラーが出ました。

2023-05-23 11:39:50 | INFO | torch.nn.parallel.distributed | Reducer buckets have been rebuilt in this iteration. 2023-05-23 11:39:50 | INFO | torch.nn.parallel.distributed | Reducer buckets have been rebuilt in this iteration. 0%|▏ | 1/440 [00:23<2:53:03, 23.65s/it, epoch=1] 0%|▏ | 1/440 [00:23<2:53:29, 23.71s/it, epoch=1] Traceback (most recent call last): File "F:\ai\vc\rvc-webui\venv\lib\site-packages\gradio\routes.py", line 412, in run_predict output = await app.get_blocks().process_api( File "F:\ai\vc\rvc-webui\venv\lib\site-packages\gradio\blocks.py", line 1299, in process_api result = await self.call_function( File "F:\ai\vc\rvc-webui\venv\lib\site-packages\gradio\blocks.py", line 1035, in call_function prediction = await anyio.to_thread.run_sync( File "F:\ai\vc\rvc-webui\venv\lib\site-packages\anyio\to_thread.py", line 31, in run_sync return await get_asynclib().run_sync_in_worker_thread( File "F:\ai\vc\rvc-webui\venv\lib\site-packages\anyio_backends_asyncio.py", line 937, in run_sync_in_worker_thread return await future File "F:\ai\vc\rvc-webui\venv\lib\site-packages\anyio_backends_asyncio.py", line 867, in run result = context.run(func, *args) File "F:\ai\vc\rvc-webui\venv\lib\site-packages\gradio\utils.py", line 491, in async_iteration return next(iterator) File "F:\ai\vc\rvc-webui\modules\tabs\training.py", line 221, in train_all train_model( File "F:\ai\vc\rvc-webui\lib\rvc\train.py", line 264, in train_model mp.spawn( File "F:\ai\vc\rvc-webui\venv\lib\site-packages\torch\multiprocessing\spawn.py", line 239, in spawn return start_processes(fn, args, nprocs, join, daemon, start_method='spawn') File "F:\ai\vc\rvc-webui\venv\lib\site-packages\torch\multiprocessing\spawn.py", line 197, in start_processes while not context.join(): File "F:\ai\vc\rvc-webui\venv\lib\site-packages\torch\multiprocessing\spawn.py", line 160, in join raise ProcessRaisedException(msg, error_index, failed_process.pid) torch.multiprocessing.spawn.ProcessRaisedException:

-- Process 1 terminated with the following error: Traceback (most recent call last): File "F:\ai\vc\rvc-webui\venv\lib\site-packages\torch\multiprocessing\spawn.py", line 69, in _wrap fn(i, *args) File "F:\ai\vc\rvc-webui\lib\rvc\train.py", line 664, in training_runner loss_mel = F.l1_loss(y_mel, y_hat_mel) * config.train.c_mel File "F:\ai\vc\rvc-webui\venv\lib\site-packages\torch\nn\functional.py", line 3264, in l1_loss return torch._C._nn.l1_loss(expanded_input, expanded_target, _Reduction.get_enum(reduction)) RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:1 and cpu!

マシンは AMD Ryzen 5800x で、GPU を2枚搭載しています。 2枚とも使う(0,1)だとエラーが出ますが、1枚だけ使う(0)だとエラーが出ずにトレーニングが完了します。

また、GPU1 (1) だけ使うようにすると別なエラーが出ます。

100%|█████████████████████████████████████████████████████████████████████████████| 423/423 [00:00<00:00, 16923.16it/s] GPU 1 is not available | 0/423 [00:00<?, ?it/s] Traceback (most recent call last):█████████████████████████████████████████████████████| 53/53 [00:18<00:00, 4.24it/s] File "F:\ai\vc\rvc-webui\venv\lib\site-packages\gradio\routes.py", line 412, in run_predict3 [00:17<00:00, 4.38it/s] output = await app.get_blocks().process_api( File "F:\ai\vc\rvc-webui\venv\lib\site-packages\gradio\blocks.py", line 1299, in process_api result = await self.call_function( File "F:\ai\vc\rvc-webui\venv\lib\site-packages\gradio\blocks.py", line 1035, in call_function00:13<00:02, 4.04it/s] prediction = await anyio.to_thread.run_sync( File "F:\ai\vc\rvc-webui\venv\lib\site-packages\anyio\to_thread.py", line 31, in run_sync return await get_asynclib().run_sync_in_worker_thread( File "F:\ai\vc\rvc-webui\venv\lib\site-packages\anyio_backends_asyncio.py", line 937, in run_sync_in_worker_thread return await future File "F:\ai\vc\rvc-webui\venv\lib\site-packages\anyio_backends_asyncio.py", line 867, in run result = context.run(func, *args) File "F:\ai\vc\rvc-webui\venv\lib\site-packages\gradio\utils.py", line 491, in async_iteration return next(iterator) File "F:\ai\vc\rvc-webui\modules\tabs\training.py", line 210, in train_all create_dataset_meta(training_dir, f0) File "F:\ai\vc\rvc-webui\lib\rvc\train.py", line 112, in create_dataset_meta names = set(list_data(gt_wavs_dir)) & set(list_data(co256_dir)) File "F:\ai\vc\rvc-webui\lib\rvc\train.py", line 106, in list_data for subdir in os.listdir(dir): File "F:\ai\vc\rvc-webui\webui.py", line 10, in listdir4mac return [file for file in _list_dir(path) if not file.startswith(".")] FileNotFoundError: [WinError 3] 指定されたパスが見つかりません。: 'F:\ai\vc\rvc-webui\models\training\models\test_v2_40k_cont_768_12_harv_14_30\3_feature256'

また一度エラーが出た後だと、利用する GPU を 0 だけに設定しても、別なエラーが出ました。 webui 再起動後すると、GPU 0 だけの設定でトレーニング出来ました。

100%|████████████████████████████████████████████████████████████████████████████████| 423/423 [00:14<00:00, 29.44it/s] train_all: emb_name: contentvec█████████████████████████████████████████████████████▏| 419/423 [00:14<00:00, 37.24it/s] Traceback (most recent call last): File "F:\ai\vc\rvc-webui\venv\lib\site-packages\gradio\routes.py", line 412, in run_predict output = await app.get_blocks().process_api( File "F:\ai\vc\rvc-webui\venv\lib\site-packages\gradio\blocks.py", line 1299, in process_api result = await self.call_function( File "F:\ai\vc\rvc-webui\venv\lib\site-packages\gradio\blocks.py", line 1035, in call_function prediction = await anyio.to_thread.run_sync( File "F:\ai\vc\rvc-webui\venv\lib\site-packages\anyio\to_thread.py", line 31, in run_sync return await get_asynclib().run_sync_in_worker_thread( File "F:\ai\vc\rvc-webui\venv\lib\site-packages\anyio_backends_asyncio.py", line 937, in run_sync_in_worker_thread return await future File "F:\ai\vc\rvc-webui\venv\lib\site-packages\anyio_backends_asyncio.py", line 867, in run result = context.run(func, *args) File "F:\ai\vc\rvc-webui\venv\lib\site-packages\gradio\utils.py", line 491, in async_iteration return next(iterator) File "F:\ai\vc\rvc-webui\modules\tabs\training.py", line 221, in train_all train_model( File "F:\ai\vc\rvc-webui\lib\rvc\train.py", line 243, in train_model training_runner( File "F:\ai\vc\rvc-webui\lib\rvc\train.py", line 342, in training_runner dist.init_process_group( File "F:\ai\vc\rvc-webui\venv\lib\site-packages\torch\distributed\distributed_c10d.py", line 853, in init_process_group raise RuntimeError("trying to initialize the default process group " "twice!") RuntimeError: trying to initialize the default process group twice!

sugarkwork avatar May 23 '23 03:05 sugarkwork