common-voice Add possibility for users to choose preferred orthography

Up front: I do not expect this feature request to be implemented any time soon (if ever), I'm just filing it so that I don't forget about it.

Is your feature request related to a problem? Please describe.

Some languages have multiple competing orthographies used in the same country. The sound is the same, but the letters are different. In many cases the conversion (at least in one direction) is mechanical.

For example:

Serbian Cyrillic and Serbian Latin
Kazakh Cyrillic and Kazakh Latin
Basaa General and Basaa Missionary
Punjabi Shahmukhi and Punjabi Gurmukhi

In some cases this is because of a stated move from one orthography to another (Kazakh) in other cases both orthographies may be maintained (Serbian).

Describe the solution you'd like

For certain languages it should be possible to provide sentences in more than one orthography and for users specify a preferred orthography.

In order for this to be valid:

All sentences must be in both orthographies

Describe alternatives you've considered

We definitely do not want to split these into separate locales, and have separate communities for e.g. Serbian Latin and Serbian Cyrillic. The audio will be identical and we would just end up fragmenting the community of contributors and validators.

Additional context

In some countries, some languages are not taught in schools. They may be taught at private institutions or organisations or they may be self-taught or taught in the home. There may be two or more competing orthographies. And some speakers may be more or less able to read in each of them.

When there is an official orthography, this should be used, but we should also consider the needs of people who might not have received an education in that particular orthography.

This is purely a user interface consideration. On the ASR side of things the orthographies could be converted automatically. But if we want to display them to people we should be able to display them in the orthography of their choosing.

Oct 23 '20 13:10 ftyers

I don't think this is going to be implemented from the devs, as it is a specific task for a specific needs. However feel free to submit a PR which implements this!

Nov 05 '20 14:11 dag7dev

Thank you for your feedback, I don't see the need as particularly specific need -- orthographic diversity is well attested in the world's languages -- but in any case, as I mention in the first line of this report, this is a reminder for myself.

Nov 05 '20 15:11 ftyers

Think this would be a very helpful feature. Serbian Cyrillic and Serbian Latin would probably be the easiest for an MVP if anyone is interested in doing a pull request as it is one letter one sound so a one-to-one mapping could be created very easily.

Is there a way to on the fly replace characters in the screen with a small, lightweight vanilla .js file or the like?

May 01 '21 14:05 darigovresearch

Serbian is lossless in the Cyrllic → Latin direction but not in the other direction because of words like injekcija vs. znanje. I think the way I would do it would be offline. If you are interested in working on this, I'd love to collaborate, I have a good idea of how it should be done. You can find me on the Common Voice channel on Matrix :)

May 01 '21 15:05 ftyers

I believe the source of Serbian translation of the site is the Cyrillic so potentially having the Latin → Cyrillic issue may not be a problem for an MVP. The 5000 minimum number of public domain sentences for Serbian have not yet been filled but it appears they too will all be in Cyrillic.

JavaScript isn't our specialty happy to beta test & contribute where possible as this may be helpful for other people. Was expecting a browser-side on-the fly computation to re-render the characters on the page. What did you mean by offline?

May 01 '21 19:05 darigovresearch

I mean that I would expect the orthography conversion do be done offline and stored. On the fly conversion will not be possible for most languages, and so an offline conversion + potential postedition step will be required. If you want help collecting public domain sentences I'm happy to help out.

May 01 '21 20:05 ftyers

The best source in order to fill the 5000 sentence requirement seemed to be to be here but am unsure if Project Gutenberg books have been used in the past and if the license is compatible

May 01 '21 21:05 darigovresearch

Here is a list of 6747 sentences in Serbian that are public domain from the SETimes corpus. I have converted them from Serbian Latin to Serbian Cyrillic and the file is in both orthographies (these can be added now).

And here is a list of the top-5000 utterances from the OpenSubtitles corpus for Serbian (all with frequency of over 500 -- at this frequency the phrases are basically people things say every day and thus are not copyrighted), again transliterated. The issue here is that there are a lot of typos in the original Latinica, for example:

576	Sta hoces?	Ста хоцес?

Sta hoces should probably be Šta hoćeš (these should be checked).

May 01 '21 21:05 ftyers

The best source in order to fill the 5000 sentence requirement seemed to be to be here but am unsure if Project Gutenberg books have been used in the past and if the license is compatible

Project Gutenberg is fine, but the texts there are quite old.

May 01 '21 21:05 ftyers

All very interesting corpuses, not sure if they can be used in this case as they mention CC0 (public domain) being a requirement.

The SETimes from the Wikipedia appears to be under a CC-BY-SA license corpus so not CC0. It's not clear if the OpenSubtitles can count as CC0 if they are subtitles coming from works which are copyrighted.

May 03 '21 14:05 darigovresearch

SETimes is public domain. I scraped the corpus in 2010. There is a paper describing it. The Croatian research lab put that licence, which is up to them. But the original text is public domain. I don't believe that the phrase "What do you want?" is copyrightable. imatge http://web.archive.org/web/20100206072508/http://setimes.com/cocoon/setimes/xhtml/en_GB/document/setimes/footer/disclaimer/disclaimer

May 03 '21 15:05 ftyers

Thanks for the clarification @ftyers. Are you able to programmatically submit the SETimes Serbian Cyrillic sentences to the sentence collector if you haven't already since they are CC0? That should be more than enough to reach the initial 5,000 to get started with. We can help with reviewing the sentences to try to get it in the next release

May 03 '21 19:05 darigovresearch

Sure, but I'd prefer to have community input before I spam them with automatically converted Cyrillic text :) Who is going to review them?

May 03 '21 19:05 ftyers

Yeah community input would be great if there is anyone that could advise

From our understanding the way that you upload them is just by copy & pasting it in a text field on the site above where one line is one sentence. No need to convert anything just remove everything before the Cyrillic characters on each line. Feel free to try a small subset like 50 lines of the https://tepozcatl.omnilingo.cc/sr/setimes.cand.Latn-Cyrl.txt file. As said we can review them in the first instance if there are no objections.

May 03 '21 19:05 darigovresearch

I've uploaded the 20 shortest ones.

May 03 '21 19:05 ftyers

The 20 seemed to be uploaded correctly & we could review them very easily in the app. Nice work & thanks for having a go. Only suggestion would make is for the source to change it to the following so that the source of the public domain is a little clearer but up to you or if anyone else can provide further clarity.

SETimes, https://tepozcatl.omnilingo.cc/sr/setimes.cand.Latn-Cyrl.txt (http://web.archive.org/web/20100206072508/http://setimes.com/cocoon/setimes/xhtml/en_GB/document/setimes/footer/disclaimer/disclaimer)

May 03 '21 20:05 darigovresearch

Will you review them if I upload the rest? (Also this is kind of hijacking this issue.) We should probably take it to Discourse.

May 03 '21 20:05 ftyers

I started reviewing those sentences for Serbian. I would probably be done in a day or two because most of the ones I've seen are less than five words long. We need less than 200 from 5000 goal.

May 21 '21 08:05 Fooftilly

Thanks to @Fooftilly & @ftyers we now have a new language which is live that proof of concept can be made for - Serbian. There is also Kazakh and Punjabi but think that Serbian may be the easiest to start off with.

Think it may be best/easiest to dynamically generate the orthographic change on the fly then needing to periodically update a manually offline generated orthographic version of a language. Potentially a vanilla JS file that contains the character mappings and updates whatever is relevant and loaded on the page. Will also need to have somewhere to be able to change preferences of the user in the settings.

Jun 20 '21 18:06 darigovresearch

Hey everyone, my name is Gina, and I am the language community coordinator for CV. I'm pleased to share that we're actively working on integrating this feature, and we'd love to collaborate with you guys. Would any of you be interested in working with us on this?

We are looking forward to this exciting feature.

Mar 26 '24 08:03 ginamoape