Piotr Żelasko
Piotr Żelasko
I'm on it - launched Canary-1B training with this, will keep you posted.
Unfortunately the Canary-1B training on 32 GPUs with bf16 AMP is diverging after ~10 hours with this SDPA implementation. I launched a second run with a different seed to confirm...
Uh, sorry for the lack of responsiveness lately. I used bf16-mixed and PyTorch 2.4.0 with CUDA 12.5. I agree we should merge it disabled by default. Let's take a look...
I'll look into that.
That used to be a constraint but at some point we dropped it. I may have missed that validation still checks for this. If you could make a PR to...
Roughly speaking, scalability in batch dimension is ~linear, but scalability in sequence length dimension is ~quadratic.
I never trained on utterances longer than ~40s so I can only share my suspicions. I’d expect with longer examples it may be more difficult for the model to find...
Would probably go with your strategy for fixed 30s chunks in both training and inference - consistent and simple..
The only thing I can think of is that smart_open is now not used for local paths. I restored this functionality in PR https://github.com/lhotse-speech/lhotse/pull/1360 - can you try with that?
One other person has reported this error but I don't know yet where it's coming from. I have a suspicion though. It's going to be a long-shot, but can you...