icefall icon indicating copy to clipboard operation
icefall copied to clipboard

Zipformer recipe for ReazonSpeech

Open Triplecq opened this issue 1 year ago • 6 comments

ReazonSpeech is an open-source dataset that contains a diverse set of natural Japanese speech, collected from terrestrial television streams. It contains more than 35,000 hours of audio.

The dataset is available on Hugging Face. For more details, please visit:

  • Dataset: https://huggingface.co/datasets/reazon-research/reazonspeech
  • Paper: https://research.reazon.jp/_static/reazonspeech_nlp2023.pdf

Triplecq avatar May 02 '24 00:05 Triplecq

There are quite a few changes not in the directory you are adding. You might want to remove those as they are potential barriers to merging it. If there's anything outside that directory you believe we should change , it can be a separate PR.

danpovey avatar May 02 '24 05:05 danpovey

There are quite a few changes not in the directory you are adding. You might want to remove those as they are potential barriers to merging it. If there's anything outside that directory you believe we should change , it can be a separate PR.

Thanks for your quick feedback during the holiday! I will remove unrelated changes and get back to you soon.

Triplecq avatar May 02 '24 08:05 Triplecq

I've already removed those unrelated changes. It's ready for review now. Please let me know if you have any questions or comments. Thank you!

Triplecq avatar May 02 '24 10:05 Triplecq

I noticed that you have „lhotse prepare reazonspeech” command in data prep, do you intend to submit a PR to Lhotse as well?

pzelasko avatar May 02 '24 13:05 pzelasko

I noticed that you have „lhotse prepare reazonspeech” command in data prep, do you intend to submit a PR to Lhotse as well?

Thanks for the note. Sure, we're cleaning up the scripts and will submit a PR to Lhotse soon. :)

Triplecq avatar May 02 '24 16:05 Triplecq

I just submitted a PR to Lhotse as well: https://github.com/lhotse-speech/lhotse/pull/1330 Both PR are ready for review. Thank you!

Triplecq avatar May 02 '24 16:05 Triplecq

Hi, May I kindly ask for some questions?

What are the main differences in quality and coverage between the small, medium, large, and all sets? Which configuration (large, all, or small+medium+large) yields the best performance?

Thanks for your assistance.

yfyeung avatar Jun 07 '24 07:06 yfyeung

Hi @yfyeung

Thank you for your interest and questions.

As far as I know, the various partitions only differ in their size and hours, as listed in the table on the Hugging Face page. (@fujimotos san could you please confirm this or correct me if I am wrong? Thank you!)

Here is a comparison of different partitions:

Model Name Model Size In-Distribution CER JSUT CER CommonVoice CER TEDx CER
zipformer-L (medium) 155.92 M 10.31 16.52 12.8 28.8
zipformer-L (large) 157.24 M 6.19 10.35 9.36 24.23
zipformer-L (all) 159.34 M 4.2 (epoch 39 avg 7) 6.62 (epoch 39 avg 2) 7.76 (epoch 39 avg 2) 17.81 (epoch 39 avg 10)

P.S., With this larger setting of the zipformer model, we suggest using data with more than 300 hours. We have not tried the combination of small + medium + large together, but I assume the performance is basically determined by the hours of your data.

I hope this helps. Feel free to let me know if you have any other questions. Good luck and have fun with this recipe. :)

Triplecq avatar Jun 07 '24 23:06 Triplecq

What are the main differences in quality and coverage between the small, medium, large, and all sets?

The only difference is their dataset sizes. Check out this table:

Name Size Hours
tiny 600MB 8.5 hours
small 6GB 100 hours
medium 65GB 1000 hours
large 330GB 5000 hours
all 2.3TB 35000 hours

Which configuration (large, all, or small+medium+large) yields the best performance?

Use all for the best performance. Other splits (tiny/small/medium/large) are subsets of the all set.

Note: In case there is some confusion, the relationship of those sets is:

tiny ⊆small ⊆ medium ⊆ large ⊆ all

fujimotos avatar Jun 08 '24 04:06 fujimotos

Please upload links to pre-trained models in a separate PR.

csukuangfj avatar Jun 13 '24 06:06 csukuangfj

@Triplecq @fujimotos Is the model ready to share ?

Please upload links to pre-trained models in a separate PR.

yujinqiu avatar Jun 13 '24 09:06 yujinqiu

@yujinqiu Thanks for your patience! We just completed another validation test on JSUT-book before the release. I will submit a separate PR and get you updated once we open the model.

Triplecq avatar Jun 14 '24 14:06 Triplecq

This may be the world's number one Japanese language recognition model. If you could create a medium stream version of the model, it would be the number one in the universe!

yuyun2000 avatar Jun 18 '24 09:06 yuyun2000

Hi @Triplecq was wondering if the model is available for sharing on HF? thanks

sangeet2020 avatar Oct 07 '24 10:10 sangeet2020

He has shared the weights, you can see the Japanese model in the document

yuyun2000 avatar Oct 08 '24 01:10 yuyun2000

@Triplecq

Could you also open-source the PyTorch checkpoints for https://huggingface.co/reazon-research/reazonspeech-k2-v2/tree/main

Currently, there are only onnx model files without any .pt files, e.g., pretrained.pt.

csukuangfj avatar Mar 25 '25 08:03 csukuangfj