Kadir Nar comments

Results 229 comments of


                                            Kadir Nar

multi-GPU training

> [@kadirnar](https://github.com/kadirnar) is this related to multi-gpu support? I don't want multi-GPU support to only work in FP8 or INT4. We can already perform full-finetuning of many models in fp8...

multi-GPU training

> As others mentioned you can get multi gpu working with accelerate. I posted how I got it working with 5090s here: > > https://github.com/thad0ctor/unsloth-5090-multiple Does it support H100 or...

multi-GPU training

@thad0ctor Thank you very much for your work. Multi-GPU support is awesome. Have you tried training large models? And is there a difference in speed? Is it really doing parallel...

Getting NaN in Mel Loss during the first few epochs for first training

> ![Image](https://github.com/user-attachments/assets/a4c23af2-336e-4a46-bfba-267699b66812) You should change the loss values in the config file. https://github.com/Respaired/Tsukasa-Speech/issues/6#issuecomment-2758477322

First step training very slow and high GPU memory

Did you do this? https://github.com/yl4579/StyleTTS2/pull/253 Yesterday I worked it with 8xA100 GPUs using batch-size=16 and got a memory error. The max_len was high though. Still, I think something's wrong.

First step training very slow and high GPU memory

> [@kadirnar](https://github.com/kadirnar) No, I didn't load a pretrained model. You mean your total batch_size=16 (for 8 gpus) or each gpu has batch_size=16 (total batch size = 16*8)? I set the...

First step training very slow and high GPU memory

@zaidato I got the same error as you when I set the batch-size to 8 😆

First step training very slow and high GPU memory

> You got an error at epoch 50. I think it's because you set TMA_epoch: 50 # TMA starting epoch (1st stage). You need to decrease batch size to fix...

First step training very slow and high GPU memory

@zaidato I managed to train using this repo. There was only a bug with context_length. I fixed that by updating the mel_dataset. If there are successful results after training, I...

First step training very slow and high GPU memory

> [@kadirnar](https://github.com/kadirnar) What does context length mean? In your repo, you set batch_size: 64 and max_len: 560. How can you increase these values without getting out of memory? The transcript...