Steve Korshakov

Results 169 comments of Steve Korshakov

Actually I am almost finished NAR model training, works really well for in-domain samples. Also you can download pre-converted datasets using my dataset tool.

My librilight-preprocessed is my naive attempt to transcribe it, but it is a failed one - too many errors and networks trained on it turned out to have too much...

They should be exactly the same, all my work is reproducible! So it is up to you.

I have finished the training, published the results. Networks follows the speaker much better than Voicebox, but still not that good as should be for out of domain speakers.

This is a zero-shot voice cloning network, nothing to train here, just 3-5 second clean sample with text

I have opted to BigVSAN - i was really impressed by it's quality, i wasn't to spot any difference from synthesized and real audio on my datasets. I have published...

i am training on quite small dataset- libritts-r + vctk. They have only high quality voice, but i want to try to do some pre-training on much bigger one to...

Changing to 16x16 head dimensions reduces gap to 10x, but still very slow.

@tridao Thank you for catching that, after the fix it is still 4x slower than flash_attn_func: xformers (mask) 0.00034342713200021533 xformers (no mask) 0.0013367030000081285 torch (mask) 0.0034441131959902123 torch (no mask) 0.0013596494959783741...

@zhangjun Thanks! But i am running this code in notebook and repeating cell execution yields similar results.