MeloTTS icon indicating copy to clipboard operation
MeloTTS copied to clipboard

Can't fine-tune a model on my dataset in Google Colab

Open yukiarimo opened this issue 10 months ago • 7 comments

🐛 Describe the bug

I ran the following code:

git clone https://github.com/myshell-ai/MeloTTS.git
cd MeloTTS
pip install -e .
python -m unidic download
cd melo
python preprocess_text.py --metadata all.list
bash train.sh config.json 1

My all.list example:

wavs/29.wav|EN-default|EN|Well, she looks exactly like the one I read about in the book, except she isn't violent at all. Hahaha.
wavs/15.wav|EN-default|EN|It's kind of a rare monster, it's incredibly ferocious!

Log after running the train.sh:

...
0it [00:00, ?it/s]
0it [00:00, ?it/s]
0it [00:00, ?it/s]
0it [00:00, ?it/s]
0it [00:00, ?it/s]
0it [00:00, ?it/s]
0it [00:00, ?it/s]
0it [00:00, ?it/s]
0it [00:00, ?it/s]
0it [00:00, ?it/s]
0it [00:00, ?it/s]
0it [00:00, ?it/s]
0it [00:00, ?it/s]
0it [00:00, ?it/s]
0it [00:00, ?it/s]
0it [00:00, ?it/s]
0it [00:00, ?it/s]
0it [00:00, ?it/s]
2024-04-28 17:58:31.390776: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-04-28 17:58:31.390837: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-04-28 17:58:31.392462: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-04-28 17:58:32.734241: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
2024-04-28 17:58:33.501 | INFO     | data_utils:_filter:64 - Init dataset...
100%|█████████████████| 96/96 [00:00<00:00, 21500.06it/s]
2024-04-28 17:58:33.507 | INFO     | data_utils:_filter:84 - min: 1870; max: 1871
2024-04-28 17:58:33.507 | INFO     | data_utils:_filter:85 - skipped: 9, total: 96
Bucket warning  
buckets: []
/usr/local/lib/python3.10/dist-packages/torch/utils/data/dataloader.py:554: UserWarning: This DataLoader will create 16 worker processes in total. Our suggested max number of worker in current system is 8, which is smaller than what this DataLoader is going to create. Please be aware that excessive worker creation might get DataLoader running slow or even freeze, lower the worker number to avoid potential slowness/freeze if necessary.
  warnings.warn(_create_warning_msg(
2024-04-28 17:58:33.508 | INFO     | data_utils:_filter:64 - Init dataset...
100%|███████████████████| 4/4 [00:00<00:00, 20763.88it/s]
2024-04-28 17:58:33.509 | INFO     | data_utils:_filter:84 - min: 1870; max: 1871
2024-04-28 17:58:33.509 | INFO     | data_utils:_filter:85 - skipped: 0, total: 4
Using noise scaled MAS for VITS2
Using duration discriminator for VITS2
(torch.Size([10, 192]), torch.Size([8, 192]))
(torch.Size([256, 256]), torch.Size([1, 256]))
list index out of range
0it [00:00, ?it/s]/usr/lib/python3.10/multiprocessing/popen_fork.py:66: RuntimeWarning: os.fork() was called. os.fork() is incompatible with multithreaded code, and JAX is multithreaded, so this will likely lead to a deadlock.
  self.pid = os.fork()
0it [00:00, ?it/s]
/usr/local/lib/python3.10/dist-packages/torch/optim/lr_scheduler.py:138: UserWarning: Detected call of `lr_scheduler.step()` before `optimizer.step()`. In PyTorch 1.1.0 and later, you should call them in the opposite order: `optimizer.step()` before `lr_scheduler.step()`.  Failure to do this will result in PyTorch skipping the first value of the learning rate schedule. See more details at https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate
  warnings.warn("Detected call of `lr_scheduler.step()` before `optimizer.step()`. "
0it [00:00, ?it/s]
0it [00:00, ?it/s]
0it [00:00, ?it/s]
0it [00:00, ?it/s]
0it [00:00, ?it/s]
0it [00:00, ?it/s]
0it [00:00, ?it/s]
0it [00:00, ?it/s]
0it [00:00, ?it/s]
0it [00:00, ?it/s]
0it [00:00, ?it/s]
0it [00:00, ?it/s]
0it [00:00, ?it/s]
0it [00:00, ?it/s]
0it [00:00, ?it/s]
0it [00:00, ?it/s]
0it [00:00, ?it/s]
0it [00:00, ?it/s]
0it [00:00, ?it/s]
0it [00:00, ?it/s]
0it [00:00, ?it/s]
...

How to fix this? Am I doing something wrong?

Versions

Collecting environment information...

Model name: Intel(R) Xeon(R) CPU @ 2.00GHz CPU family: 6 Model: 85 Thread(s) per core: 2 Core(s) per socket: 4 Socket(s): 1 Stepping: 3 BogoMIPS: 4000.35 Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology nonstop_tsc cpuid tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single ssbd ibrs ibpb stibp fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves arat md_clear arch_capabilities Hypervisor vendor: KVM Virtualization type: full L1d cache: 128 KiB (4 instances) L1i cache: 128 KiB (4 instances) L2 cache: 4 MiB (4 instances) L3 cache: 38.5 MiB (1 instance) NUMA node(s): 1 NUMA node0 CPU(s): 0-7 Vulnerability Gather data sampling: Not affected Vulnerability Itlb multihit: Not affected Vulnerability L1tf: Mitigation; PTE Inversion Vulnerability Mds: Vulnerable; SMT Host state unknown Vulnerability Meltdown: Vulnerable Vulnerability Mmio stale data: Vulnerable Vulnerability Retbleed: Vulnerable Vulnerability Spec rstack overflow: Not affected Vulnerability Spec store bypass: Vulnerable Vulnerability Spectre v1: Vulnerable: __user pointer sanitization and usercopy barriers only; no swapgs barriers Vulnerability Spectre v2: Vulnerable, IBPB: disabled, STIBP: disabled, PBRSB-eIBRS: Not affected Vulnerability Srbds: Not affected Vulnerability Tsx async abort: Vulnerable

Versions of relevant libraries: [pip3] numpy==1.25.2 [pip3] torch==1.13.1 [pip3] torchaudio==0.13.1 [pip3] torchdata==0.7.1 [pip3] torchsummary==1.5.1 [pip3] torchtext==0.17.1 [pip3] torchvision==0.17.1+cu121 [pip3] triton==2.2.0 [conda] Could not collect

yukiarimo avatar Apr 28 '24 18:04 yukiarimo

Same problem but in Ubuntu

28065467 avatar May 19 '24 17:05 28065467

same problem inside docker image

farconada avatar May 30 '24 21:05 farconada

It was a few days ago but I believe I had the some, windows 11 python 3.10. I was able to infer on the pretrained weights, but after spending hours tweaking the training script I was ably able to do 1 iteration in torchrun and basically nothing happened. Would love to be able to fine-tune...

s-tweed avatar Jun 04 '24 04:06 s-tweed

same issue ubuntu python 3.9

olgakuak avatar Jun 25 '24 14:06 olgakuak

@s-tweed Any updates?

yukiarimo avatar Jul 01 '24 14:07 yukiarimo

also hitting a similar issue using the docker image.

rgxb2807 avatar Sep 06 '24 16:09 rgxb2807