common-voice [BUG][FR] Complete text-corpora not accessible anymore

Describe the bug

I first thought this post as a feature request, but it is obviously a bug related to workflow change. A Common Voice provided dataset consists of text-corpus and voice corpus. The dataset releases contain only the recorded text-corpora (embedded in .tsv files which should be de-dupped), but the whole text-corpora was accessible through this repo (under server/data/[lc]/*.txt). There were two ways to add new text-corpus:

Bulk submissions that end as text data, under server/data/[lc]/*.txt
Sentence Collector added/verified/exported sentences, into server/data//sentence_collector.txt

Now, with the new interface, anything added through write/review tabs and/or added through a TSV format PR, ends in the database directly, which is not accessible by the public.

Unless this is corrected, the dataset is crippled.

Expected behavior

The old behavior should continue (or be mimicked).

Bulk submissions through TSV style data should end as text data, under server/data/[lc]/*.txt (e.g. if I scan a public domain book named "Xxx Yyy" by writer Zzz and post it as zzz_xxx_yyy.tsv, that should end as zzz_xxx_yyy.txt, so other people know what is added as bulk)
Write/review process added/verified sentences should also immediately be added to server/data/[lc]/<put_a_generic_name_here>.txt.

Why is this important

People speak what they are provided from the text corpus
Vocabulary and resultant phonemes are very important for voice models
For people (be them scientists or engineers/researchers) to analyze/compare datasets/versions etc you need the whole-corpus.
For communities to add more vocabulary, they need to analyze the existing text-corpora

Other possibilities

Actually, the previous approach also had its shortcomings: It was lacking the time axis, as the text-corpora are not related to releases. A better approach would be to provide a snapshot of sentences as part of the released datasets (in .tar.gz files) - this is kind of a feature request. This will enable us to analyze text-corpora between releases. Until v13.0 I was doing it by cloning the repo after a release and analyzing the files under server/data - which is not possible anymore.

Jul 03 '23 11:07 HarikalarKutusu

Hello, thanks for bringing this up.

The reason why sentences were available as text-corpora is that the old Sentence Collector was in a separate repo and database. Therefore, we had to resort to temporary technical fixes like storing the sentences as text files and importing them into Common Voice.

Now that we have both the Sentence collector and common voice sharing the same repo and database, it is more efficient to have the sentences in the database and not as text files anymore.

We have plans for an API in our roadmap (we don’t have firm dates for delivery yet) that would allow users to get the sentences for a language which would replace having the text corpora for the languages locally in text files.

When this API is ready we will let the community know so they can make use of it.

Thanks for your understanding.

Jul 03 '23 11:07 moz-rotimib

Thank you for the quick answer @moz-rotimib, I really appreciate it. I hit this issue while updating the data in my analyzer webapp (text-corpus tab).

We have plans for an API in our roadmap

I have concerns about this also:

Common Voice API is not public, AFAIK meaning "do not hit it as users, we use it internally, we can change it anytime without versioning etc."
For some languages, the text corpus is very large (e.g. en has 1,586,229 sentences and 99,546,635 characters). Hitting such API endpoints for 100+ languages will not be feasible. The alternative (git) was very efficient wrt this.

Follow up:

I concur with the database decision wrt performance, but wouldn't it be logical to continue the old workflow until the new one is fully incorporated (same for the multi-sentence write interface)?
What will happen to cv-sentence-extractor exports? I mean, currently and in the long run. I'm working on it for 2-3 weeks for now...
Will the API include "source" info (which will make the data transfers very huge but which is very valuable - e.g. one could get sentences from religious texts and exclude them from training or vice-versa depending on domain)
What about including the text corpora with the datasets?

I was thinking about a comparative analysis of text-corpora between versions and resource-based analysis (had been .txt files), which is a bust for now.

Jul 03 '23 12:07 HarikalarKutusu

What will happen to cv-sentence-extractor exports? I mean, currently and in the long run. I'm working on it for 2-3 weeks for now...

I'd say this should not be an exception and always use whatever is being used for bulk submissions. Right now this might mean PRs, in the future, as I understand, through the website itself.

Jul 03 '23 14:07 MichaelKohler

This also breaks the workflow I had for commonvoice-utils, where I could use the text corpus dump to make data normalisers/validators for the currently available languages. Now I don't have any way to access that. Having the data in a repo, even if it is only updated every week or so would be extremely useful.

Jul 03 '23 15:07 ftyers

Having the data in a repo, even if it is only updated every week or so would be extremely useful.

I'm also all in for this. I think there is nothing preventing this.

I'd say this should not be an exception

With regard to my previous post, granularity in the data is better than merging it and losing information. For example, as I mentioned multiple times, the language quality in Wikipedia for Turkish (I suppose for many languages) is not superb. If I see an offending sentence, I can locate the reason if the data is sentence_extractor.txt (or we have a source field).

Jul 03 '23 15:07 HarikalarKutusu

As a stop-gap, I think an scheduled job that dumps the sentence data into a separate repo* or s3 (or whatever datastore) bucket with a publicly accessible link would work. It'd allow CV to deprecate and ultimately remove the sentence text files (which would greatly reduce image/repo size, improve start times, etc) while providing the community access to the data until a proper API/frontend can be implemented.

Realistically, if anyone wants to try their hand at it, the existing queue/scheduling logic in takeout.ts can be used as inspiration, or my current PR #4088

*dumping to a Github Repo would be pretty messy

Jul 11 '23 15:07 Rebreda

This also creates a very hard time for the local community to monitor the new sentences, so we can't PR to remove the problematic one. Many more bad sentences, e.g., sentences with mixed languages, have been live now on site.

We have a hard time now helping them improve or promote or provide further help, and we faced a more hard challenge when we demonstrate a common voice to people, other organizations, or linguistics.

Aug 17 '23 03:08 irvin

With Common Voice datasets v17.0 release, the whole text corpus is included in the .tar.gz release files, as validated and invalidated. The team also added unique hashes for sentences where a sentence exists, so using this as an index will speed up analysis code considerably. This is more than I expected, excellent! Closing this...

Thank you team!

Mar 21 '24 12:03 HarikalarKutusu