Upcoming releases and support
We are planning new cool releases sometime in future (with a twist you are not expecting), soon!
Also now you can support our initiative directly using open collective

Hi, @snakers4 . I wrote little python program which convert data from unpacked datasets and prepare train, dev, test files according Deepspeech format. Also programm create transcript file for preparing kenlm language model. Is it interesting for you? May I push script to your repository?
Please create a pull request
Current status update
Given the current figures:
- ~US$30 per month community support (3 backers via open collective)
- ~US$300 direct hosting fees for 16 days of February, i.e. US$500-600 per month
- Only 3 users downloading the torrent vs. at least 100 direct downloads this month + ~300-400 total direct downloads vs. ~10-15 total torrent downloads
And the fact that some people downloading the dataset are clearly abusing our licence (i.e. obviously commercial companies claiming to use our dataset for "research" purposes) - I have decided to temporarily suspend the direct downloads.
Please - if you are an open collective backer and you need a direct link, please ping me, I will send you a private link.
Further ideas:
- We will migrate the whole dataset to
opus, most likely the whole dataset will be shared ONLY via torrent, at least until we get around US$200-300 support per month via open-collective - Still undecided whether to include new domains / languages given lack of community support
P.S.
От себя персонально скажу - если бы из примерно 400 скачавших хотя бы 10% бы поддерживали нас на 10 долларов в месяц, то датасет был бы доступен для всех по прямой ссылке. Но статистика выше в сочетании с отношением некоторых компаний наводят меня на определенные мысли.
Hi, @32r81b . I wrote little python program which convert data from unpacked datasets and prepare train, dev, test files according Deepspeech format. Also programm create transcript file for preparing kenlm language model. Is it interesting for you? May I push script to your repository?
Can you please share the script (I wrote my own but it runs slow and dosent read Russian characters) Я был бы очень благодарен
Извини файл еще в разработке. Ниже ссылка на черновой вариант. Если хочешь обрабатывать несколько больших файлов лучше параллелить. Скрипт читает csv файл со списком исключений (public_exclude_file_v5.csv)+ читает файл по датасету (public_youtube700.csv). Далее отбрасывает записи из public_youtube700.csv по списку public_exclude_file_v5.csv, считывает оставшиеся файлы с диска, определяет длительность аудио, конвертит в 8 нужный формат и кладет в отдельную папку. Так же отсеиваются очень короткие и длинные аудио. В конце сохраняется текстовый файл в формате обучения deepspeech.
https://github.com/32r81b/open_stt/blob/master/utils/0.1%20open_stt_prepeare%200.py
Academic torrents is down Wrote to their admin to see what happens
Working on hosting the torrent elsewhere
https://rutracker.org/forum/viewtopic.php?t=5880804
Not approved yet Not sure how my client (QBittorrent) will properly support uploading via several trackers (I am a bit rusty in how torrents work under the hood on lower levels)
Ppl, please seed

My upload speed is 20-30 MiB/s, so it definitely works
also a magnet until the rutracker page gets approved
magnet:?xt=urn:btih:A7929F1D8108A2A6BA2785F67D722423F088E6BA&tr=http%3A%2F%2Fbt3.t-ru.org%2Fann%3Fmagnet&dn=Russian%20Open%20Speech%20To%20Text%20(STT%2FASR)%20Dataset%20%5B100%2C%2016000000%5D
academic torrents is back up so no worries
@32r81b а где взять файл public_exclude_file_v5.csv ? его нет в торренте
@32r81b спасибо за ответ
@32r81b а где взять файл public_exclude_file_v5.csv ? его нет в торренте
он лежит в тикетах
New release https://github.com/snakers4/open_stt/releases/tag/v1.01
New release https://github.com/snakers4/open_stt/releases/tag/v1.02
A few announcements
- wav torrent to be deprecated shortly please switch to opus
- opus reader helpers and build instructions available
- there was a surge in using some legacy links - all of them will be permanently disabled shortly
A few announcements
- Academic torrents moved to a new infrastructure
- Please seed if you have downloaded the torrent
- Microsoft is not sharing the download (or any whatsoever) statistics regarding their hosting - please leave any form of feedback on their direct links
Managed to fix seeding issues with new server OS version
https://github.com/snakers4/open_stt/issues/34
Update 2021-06-04
Added Zenodo direct link mirrors as well.
Azure links were reported to be very slow