Voice-Cloning-App
Voice-Cloning-App copied to clipboard
TypeError: forward() missing 1 required positional argument: 'inputs' when training
I'm trying to run training and getting the following error;
Exception in thread Thread-19:
Traceback (most recent call last):
File "/usr/lib/python3.7/threading.py", line 926, in _bootstrap_inner
self.run()
File "/usr/lib/python3.7/threading.py", line 870, in run
self._target(*self._args, **self._kwargs)
File "/ssd/github/BenAAndrew/Voice-Cloning-App/application/utils.py", line 66, in background_task
raise e
File "/ssd/github/BenAAndrew/Voice-Cloning-App/application/utils.py", line 62, in background_task
func(logging=logger, **kwargs)
File "/ssd/github/BenAAndrew/Voice-Cloning-App/training/train.py", line 219, in train
y, y_pred = process_batch(batch, model)
File "/ssd/github/BenAAndrew/Voice-Cloning-App/training/tacotron2_model/utils.py", line 88, in process_batch
y_pred = model(batch, mask_size=output_length_size, alignment_mask_size=input_length_size)
File "/ssd/github/BenAAndrew/Voice-Cloning-App/venv/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
return forward_call(*input, **kwargs)
File "/ssd/github/BenAAndrew/Voice-Cloning-App/venv/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 168, in forward
outputs = self.parallel_apply(replicas, inputs, kwargs)
File "/ssd/github/BenAAndrew/Voice-Cloning-App/venv/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 178, in parallel_apply
return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
File "/ssd/github/BenAAndrew/Voice-Cloning-App/venv/lib/python3.7/site-packages/torch/nn/parallel/parallel_apply.py", line 86, in parallel_apply
output.reraise()
File "/ssd/github/BenAAndrew/Voice-Cloning-App/venv/lib/python3.7/site-packages/torch/_utils.py", line 425, in reraise
raise self.exc_type(msg)
TypeError: Caught TypeError in replica 6 on device 6.
Original Traceback (most recent call last):
File "/ssd/github/BenAAndrew/Voice-Cloning-App/venv/lib/python3.7/site-packages/torch/nn/parallel/parallel_apply.py", line 61, in _worker
output = module(*input, **kwargs)
File "/ssd/github/BenAAndrew/Voice-Cloning-App/venv/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
return forward_call(*input, **kwargs)
TypeError: forward() missing 1 required positional argument: 'inputs'
A quick googling makes it seems like this might be related to https://github.com/pytorch/pytorch/issues/31460 ?
I had to reduce the batch size to 128 otherwise the first GPU runs out of memory with the dreaded "CUDA out of memory" error.
This machine has 8 x V100 16GB Nvidia GPUs in it (I've excluded the T1000 using CUDA_VISIBLE_DEVICES
). See below;
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 530.30.02 Driver Version: 530.30.02 CUDA Version: 12.1 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 Tesla V100-SXM2-16GB On | 00000000:0D:00.0 Off | 0 |
| N/A 40C P0 56W / 300W| 16147MiB / 16384MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 1 Tesla V100-SXM2-16GB On | 00000000:0E:00.0 Off | 0 |
| N/A 38C P0 57W / 300W| 12529MiB / 16384MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 2 Tesla V100-SXM2-16GB On | 00000000:14:00.0 Off | 0 |
| N/A 36C P0 57W / 300W| 11233MiB / 16384MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 3 Tesla V100-SXM2-16GB On | 00000000:15:00.0 Off | 0 |
| N/A 39C P0 57W / 300W| 10655MiB / 16384MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 4 NVIDIA T1000 8GB On | 00000000:82:00.0 Off | N/A |
| 35% 33C P8 N/A / 50W| 6MiB / 8192MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 5 Tesla V100-SXM2-16GB On | 00000000:8B:00.0 Off | 0 |
| N/A 37C P0 40W / 300W| 4MiB / 16384MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 6 Tesla V100-SXM2-16GB On | 00000000:8C:00.0 Off | 0 |
| N/A 34C P0 41W / 300W| 4MiB / 16384MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 7 Tesla V100-SXM2-16GB On | 00000000:8F:00.0 Off | 0 |
| N/A 36C P0 38W / 300W| 4MiB / 16384MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 8 Tesla V100-SXM2-16GB On | 00000000:90:00.0 Off | 0 |
| N/A 37C P0 40W / 300W| 4MiB / 16384MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
I tried limiting the number of GPUs to 4 with a batch size of 64 and it seems to work? Output below;
Mon May 29 02:17:19 2023
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 530.30.02 Driver Version: 530.30.02 CUDA Version: 12.1 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 Tesla V100-SXM2-16GB On | 00000000:0D:00.0 Off | 0 |
| N/A 43C P0 75W / 300W| 11675MiB / 16384MiB | 18% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 1 Tesla V100-SXM2-16GB On | 00000000:0E:00.0 Off | 0 |
| N/A 40C P0 74W / 300W| 7899MiB / 16384MiB | 17% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 2 Tesla V100-SXM2-16GB On | 00000000:14:00.0 Off | 0 |
| N/A 38C P0 72W / 300W| 7263MiB / 16384MiB | 17% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 3 Tesla V100-SXM2-16GB On | 00000000:15:00.0 Off | 0 |
| N/A 42C P0 75W / 300W| 6923MiB / 16384MiB | 15% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 4 NVIDIA T1000 8GB On | 00000000:82:00.0 Off | N/A |
| 35% 32C P8 N/A / 50W| 6MiB / 8192MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 5 Tesla V100-SXM2-16GB On | 00000000:8B:00.0 Off | 0 |
| N/A 37C P0 40W / 300W| 4MiB / 16384MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 6 Tesla V100-SXM2-16GB On | 00000000:8C:00.0 Off | 0 |
| N/A 34C P0 41W / 300W| 4MiB / 16384MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 7 Tesla V100-SXM2-16GB On | 00000000:8F:00.0 Off | 0 |
| N/A 36C P0 38W / 300W| 4MiB / 16384MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 8 Tesla V100-SXM2-16GB On | 00000000:90:00.0 Off | 0 |
| N/A 37C P0 40W / 300W| 4MiB / 16384MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| 0 N/A N/A 1428813 C python 11668MiB |
| 1 N/A N/A 1428813 C python 7892MiB |
| 2 N/A N/A 1428813 C python 7256MiB |
| 3 N/A N/A 1428813 C python 6916MiB |
+---------------------------------------------------------------------------------------+
Appreciate any advice for getting this working.....