open_stt icon indicating copy to clipboard operation
open_stt copied to clipboard

Upcoming releases and support

Open snakers4 opened this issue 6 years ago • 23 comments

We are planning new cool releases sometime in future (with a twist you are not expecting), soon!

Also now you can support our initiative directly using open collective image

snakers4 avatar Sep 03 '19 15:09 snakers4

Hi, @snakers4 . I wrote little python program which convert data from unpacked datasets and prepare train, dev, test files according Deepspeech format. Also programm create transcript file for preparing kenlm language model. Is it interesting for you? May I push script to your repository?

32r81b avatar Jan 14 '20 13:01 32r81b

Please create a pull request

snakers4 avatar Jan 14 '20 14:01 snakers4

Current status update

Given the current figures:

  • ~US$30 per month community support (3 backers via open collective)
  • ~US$300 direct hosting fees for 16 days of February, i.e. US$500-600 per month
  • Only 3 users downloading the torrent vs. at least 100 direct downloads this month + ~300-400 total direct downloads vs. ~10-15 total torrent downloads

And the fact that some people downloading the dataset are clearly abusing our licence (i.e. obviously commercial companies claiming to use our dataset for "research" purposes) - I have decided to temporarily suspend the direct downloads.

Please - if you are an open collective backer and you need a direct link, please ping me, I will send you a private link.

Further ideas:

  • We will migrate the whole dataset to opus, most likely the whole dataset will be shared ONLY via torrent, at least until we get around US$200-300 support per month via open-collective
  • Still undecided whether to include new domains / languages given lack of community support

P.S.

От себя персонально скажу - если бы из примерно 400 скачавших хотя бы 10% бы поддерживали нас на 10 долларов в месяц, то датасет был бы доступен для всех по прямой ссылке. Но статистика выше в сочетании с отношением некоторых компаний наводят меня на определенные мысли.

snakers4 avatar Feb 17 '20 08:02 snakers4

Hi, @32r81b . I wrote little python program which convert data from unpacked datasets and prepare train, dev, test files according Deepspeech format. Also programm create transcript file for preparing kenlm language model. Is it interesting for you? May I push script to your repository?

Can you please share the script (I wrote my own but it runs slow and dosent read Russian characters) Я был бы очень благодарен

Advencher avatar Mar 22 '20 20:03 Advencher

Извини файл еще в разработке. Ниже ссылка на черновой вариант. Если хочешь обрабатывать несколько больших файлов лучше параллелить. Скрипт читает csv файл со списком исключений (public_exclude_file_v5.csv)+ читает файл по датасету (public_youtube700.csv). Далее отбрасывает записи из public_youtube700.csv по списку public_exclude_file_v5.csv, считывает оставшиеся файлы с диска, определяет длительность аудио, конвертит в 8 нужный формат и кладет в отдельную папку. Так же отсеиваются очень короткие и длинные аудио. В конце сохраняется текстовый файл в формате обучения deepspeech.

https://github.com/32r81b/open_stt/blob/master/utils/0.1%20open_stt_prepeare%200.py

32r81b avatar Mar 29 '20 08:03 32r81b

Academic torrents is down Wrote to their admin to see what happens

snakers4 avatar Apr 14 '20 12:04 snakers4

ru_open_stt_wav_v10.zip

The torrent file Not sure how to manually add peers yet

snakers4 avatar Apr 14 '20 13:04 snakers4

Working on hosting the torrent elsewhere

snakers4 avatar Apr 14 '20 13:04 snakers4

https://rutracker.org/forum/viewtopic.php?t=5880804

Not approved yet Not sure how my client (QBittorrent) will properly support uploading via several trackers (I am a bit rusty in how torrents work under the hood on lower levels)

snakers4 avatar Apr 14 '20 13:04 snakers4

Ppl, please seed

image

snakers4 avatar Apr 14 '20 13:04 snakers4

My upload speed is 20-30 MiB/s, so it definitely works

snakers4 avatar Apr 14 '20 13:04 snakers4

also a magnet until the rutracker page gets approved

magnet:?xt=urn:btih:A7929F1D8108A2A6BA2785F67D722423F088E6BA&tr=http%3A%2F%2Fbt3.t-ru.org%2Fann%3Fmagnet&dn=Russian%20Open%20Speech%20To%20Text%20(STT%2FASR)%20Dataset%20%5B100%2C%2016000000%5D

snakers4 avatar Apr 14 '20 13:04 snakers4

academic torrents is back up so no worries

snakers4 avatar Apr 14 '20 14:04 snakers4

@32r81b а где взять файл public_exclude_file_v5.csv ? его нет в торренте

Advencher avatar Apr 19 '20 14:04 Advencher

@32r81b спасибо за ответ

Advencher avatar Apr 19 '20 14:04 Advencher

@32r81b а где взять файл public_exclude_file_v5.csv ? его нет в торренте

он лежит в тикетах

snakers4 avatar May 04 '20 07:05 snakers4

New release https://github.com/snakers4/open_stt/releases/tag/v1.01

snakers4 avatar May 04 '20 07:05 snakers4

New release https://github.com/snakers4/open_stt/releases/tag/v1.02

snakers4 avatar May 05 '20 05:05 snakers4

A few announcements

  • wav torrent to be deprecated shortly please switch to opus
  • opus reader helpers and build instructions available
  • there was a surge in using some legacy links - all of them will be permanently disabled shortly

snakers4 avatar May 09 '20 05:05 snakers4

A few announcements

  • Academic torrents moved to a new infrastructure
  • Please seed if you have downloaded the torrent
  • Microsoft is not sharing the download (or any whatsoever) statistics regarding their hosting - please leave any form of feedback on their direct links

snakers4 avatar Aug 27 '20 05:08 snakers4

Managed to fix seeding issues with new server OS version

https://github.com/snakers4/open_stt/issues/34

snakers4 avatar Sep 23 '20 08:09 snakers4

Update 2021-06-04

Added Zenodo direct link mirrors as well.

snakers4 avatar Jun 04 '21 16:06 snakers4

Azure links were reported to be very slow

snakers4 avatar Jun 04 '21 16:06 snakers4