Open-Assistant Add Swabian Dialect

Added a PR as requested in #2877

This is the Translation of the interface for the German dialect "Schwäbisch"

According to Wictionary and the ISO 639-3 the shortcut is {swg}.

Here is the Language/Dialect Description page on Wikipedia: Swabian German

I would love to see people contributing to this dialect and will be adding prompts and assistant replies as soon as it is pulled 😊

Apr 27 '23 14:04 Logophoman

😅

Apr 28 '23 18:04 AbdBarho

It might be worth considering to choose a broader approach and choose the complete alemannic dialect group (iso: als) This would include badenian and also Swiss dialects, which would increase the number of possible contributors a lot. (10 million speakers instead of just 800 000)

Plus there is also an alemannic Wikipedia (https://als.m.wikipedia.org/) so there is at least some training data for the base model available while the number of mere svabian texts to train a language model is probably very small.

I understand that this would create issues with different dialects that are not exactly the same, but as long as the dialect isn't too thick this should work in written conversations.

BTW, you can make ChatGPT speak pretty good badenian using this prompt, so it is definitely possible to make a language model speak German dialects https://github.com/stefangrotz/prompts/blob/main/alemanic-assistant.md

May 01 '23 21:05 stefangrotz

@stefangrotz according to this Wikipedia article the language code als was used when there was yet no established distinction between the allemanic dialects.

I actually think, while a mixed approach will probably get more labelers and contributors generally involved in the system - I feel like for training the model it won't make that much of a difference overall if you have a solution that tries to include all dialects into one, but will rather cause massive conflicts between the labelers (since some swabian, badenian words and expressions are just different) and then stuff would get up/downvoted between swabian/badenian native speakers all the time, reducing overall quality. Since the model is multilingual anyway it will get i.e. a german prompt and then know that it's supposed to generate in german. Same will be the case for these dialects - if there is enough training data.

The key is the multilinguality of our system -> It can still learn from closely related languages:

If you for instance prompt in norwegian (I think there were ~100 messages trained when I checked) it will sometimes answer in english, norwegian, danish or swedish -> These languages are already closely related and there are not many messages currently. So the system cannot distinguish properly. However if you take close languages with more trees it becomes better and better. But what this shows is that there is no point in a one-size-fits-all solution, since the model will generalize from everything that is related anyway.

My norwegian friend could understand the answers it generated even tough stuff was mixed up with danish and swedish, but once more data exists the languages will become more distinct...

-> I think it is better to have one tree per dialect (maybe we generally need to think of a way of handling sub-languages i.e. all british/german dialects) because then you know this is swabian,walser german, swiss german etc. And it also makes separating distinct dialects out of our training dataset easier (for instance if someone makes a badenian chatbot and takes out conflicting swiss and swabian accents that sound similar but are a bit different, but would reduce the badenian quality).

So I really think if you can make a seperation between the dialects it really should be done, and yeah even tough 800.000 native speakers isn't too much i think it's a good experiment as well and I know a bunch of people that would get involved labeling.

@stefangrotz I personally suggest that if you speak any allemanic dialects that you make a PR with these dialects added and their respective language code.

I'm not a linguist but I think everything that is distinct enough to have one of these language codes is probably worth adding (If there are contributors that are willing to help).

According to the Allemanic German article I think we would probably want to make PR's for all 4 distinct Allemanic Languages that used to be under the als code in the long run:

Alemannic (badenian)/Swiss German/Alsatian - gsw
Swabian (this PR) - swg
Walser German - wae
Colonia Tovar German - gct

Maybe we'll need to discuss more how we should handle dialects generally. Maybe the core team, i.e. @AbdBarho could join the discussion and potentially review this PR.

May 02 '23 07:05 Logophoman

Okay, I see your points. If you belive that you can mobilize enough people to create a swabian dataset, then a separated language version would be fine for me.

Adding many small dialects could lead to a very long language list though, but this isn't necessarily a bad thing and Open Assistant is probably the only place where a dialect dialogue dataset can be built up right now. The only real downside is more work during the data export.

May 02 '23 08:05 stefangrotz

Yeah! I think in the long run we might want to maybe add something like a conditional dialects panel that shows up when dialects exist, especially if there are a ton of dialects that can be attributed to a language. Taking German as an example I think there is a ton of other dialects that could be added (Plattdeutsch, Sächsisch etc.) So maybe one could then choose German and then in another Dialects option Swiss German. I think the work during data export is an issue, but I mean having these dialects is something I would consider to be worth wile and I think it could also help to preserve these dialects better against decay and provide lots of value to i.e. old people that are stuck with their dialect 🤔

Also if you think for instance about the many Indian and Chinese dialects that are spoken by millions of people, having a good way of dealing with that could be genius for all the native speakers!

May 02 '23 08:05 Logophoman

I believe a simple labeling system for both language variants and topics for specialists will be necessary at some point.

I also worked for the project Common Voice and there this is also an ongoing issue. Languages like Portuguese where the Brazilian variant is very different should be split up in a way, but doing it is hard. I think a labeling system inside of a standardized language ist the easiest solution.

For dialects a separated corpus might still be a better solution, but it is a thin line between a variant and a dialect. For this PR just adding Svabian as a new language looks like the only possible solution for now.

May 02 '23 09:05 stefangrotz

@stefangrotz that's good to know, awesome work! I think handling dialects won't ever be straightforward, especially since it is hard to determine how closely related a language or dialect really is from the original source and some are more closely related and others are further away 🤔 - I think fortunately Swabian is quite distinct and can be added as a separate language for now (most "normal" Germans barely understand it, if even a bit), but in the long run I think it's a good idea to discuss the way we want to integrate dialects, related languages and variants into the Open Assistant 👍

@yk could you potentially review this discussion and merge this PR?

May 03 '23 06:05 Logophoman

@Logophoman Chrome currently does not return the correct display name for "swg" (returns just "swg") .. a mapping for "swg" probably needs to be added here: https://github.com/LAION-AI/Open-Assistant/blob/f25f74aa772707fa4e04260846a119b5936f04c1/website/src/lib/languages.ts#L6

May 11 '23 12:05 andreaskoepf

@andreaskoepf Added the Mapping to the Open-Assistant/website/src/lib/languages.ts and also fixed some typos I made in my initial commit.

May 11 '23 13:05 Logophoman

@Logophoman thanks, could you quickly resolve the conflicts?

May 11 '23 13:05 andreaskoepf

Open-Assistant Open-Assistant copied to clipboard

Add Swabian Dialect

Open-Assistant
Open-Assistant copied to clipboard