monkeytype WIP: Update German Language Files

WIP: Update German Language Files

Open Syphdias opened this issue 1 year ago • 28 comments

I noticed some of the German word being capitalised, when they should not – unless at the start of the sentence. I believe this happened because the sentences analysed where not normalized to fix this issue.

I believe that in monkeytype without punctuation the words should be capitalised (or not) as if they were in the middle of a sentence. Only if punctuation is enabled, words should get capitalized after a period.

To fix this I took a look at the original source mentioned in the comment section of the JSON files. I wrote a script to download the frequency map, correct capitalization, remove non-German words, abbreviations and words from other languages.

I do not believe this list of corrections to be exhaustive and it might need adjustments in the future. Please feel free to open a PR against the script I used: https://github.com/Syphdias/monkeytype-generate-german

I have a few things where I need help with

What is preferred as a base for analysis? The source provides a frequency list for "news" and for "wikipedia". In this PR I use "news" because I felt the current words were taken from that as well. Which one would be better?
The guide line says to not include "swear words". With a quick search in german_250k.json I found at least a few obvious ones like "Arschloch" (arsehole) and "scheiße" (shit/crap). How serious is this requirement? And should I try to find a few obvious ones and filter them out? There is no guarantee that I can get all of them. What do you thin?

Jul 07 '22 22:07 Syphdias

The PR check action failed. Please review the logs and make the necessary changes. https://github.com/monkeytypegame/monkeytype/actions/runs/2632719588

Jul 08 '22 22:07 github-actions[bot]

What is preferred as a base for analysis? The source provides a frequency list for "news" and for "wikipedia". In this PR I use "news" because I felt the current words were taken from that as well. Which one would be better?

Whichever feels more "natural" to type. Whichever includes the most common words that are used.

The guide line says to not include "swear words". With a quick search in german_250k.json I found at least a few obvious ones like "Arschloch" (arsehole) and "scheiße" (shit/crap). How serious is this requirement? And should I try to find a few obvious ones and filter them out? There is no guarantee that I can get all of them. What do you thin?

Keep the swear words to an absolute minimum. PG13 please

Jul 11 '22 11:07 Miodec

Keep the swear words to an absolute minimum. PG13 please

I cannot guarantee that. I'll try to get the most obvious ones. But to be clear, I found all of them in the current codebase as well. I'll get to it on the weekend probably.

Jul 13 '22 22:07 Syphdias

So, I have a few more questions.

Should the following be included?
- people names (I'd say no)
- company names (I'd say no)
- names of counties/places (yes, because they can be different in different languages)
I noticed the lists not being sorted alphabetically – at least the non-250k ones. Is this on purpose because of frequency or should I sort all language files?

Jul 16 '22 14:07 Syphdias

So, I have a few more questions.

Should the following be included?

people names (I'd say no)

company names (I'd say no)

names of counties/places (yes, because they can be different in different languages)

I noticed the lists not being sorted alphabetically – at least the non-250k ones. Is this on purpose because of frequency or should I sort all language files?

I agree on the names. Sorting doesn't matter.

Jul 18 '22 10:07 Miodec

This PR is stale. Please trigger a re-run of the PR check action.

Jul 25 '22 20:07 github-actions[bot]

First of: Sorry, for letting this go stale. Looking through hundred-thousands of words isn't exactly fun.

I believe, @BlackSagittarius, you initially contributed the 250k german words list. I was wondering how you filtered for bogus inputs in the list. For example I found some dutch words for some reason.

Jul 31 '22 09:07 Syphdias

To clarify: After removing a lot of bogus there are only about 206k words left that appear more than once in the source texts. This means that everything that follows is only mentioned once and to fill up to 250k it ends up with words about with capital A to words with capital E. I hardly scratched the As since it is either weird German words ("Abdomentransversaldurchmesser" – "Gesundheit") or it is Footballers names starting with A or it is places I have never heard of.

Jul 31 '22 10:07 Syphdias

@Syphdias Sorry for not reacting to the problem earlier. I was the one who contributed the list and I just checked it very briefly and absolutely not by hand. That's why there are a few pr maybe a lot of wrong words in there. So what I did was making a list while typing and when I see a wrong word I just put it in my list and then at a later point make a big PR with all things I got. I thought that having a 250k list with around 0,25% wrong words would be better than having non and so I sticked with this idea.

Jul 31 '22 10:07 BlackSagittarius

I also was the one who contributed pretty much everything for German, so you can just ask me if you have anything else, like the capitalized words in the 1k and 10k lists. Though this seems to be sorted by now...

Jul 31 '22 10:07 BlackSagittarius

Do you still happen to have the list with bogus words?

Jul 31 '22 10:07 Syphdias

Maybe. Though I can't look atm...

Jul 31 '22 10:07 BlackSagittarius

In the end I would probably be able to find the list again in the internet.

Jul 31 '22 10:07 BlackSagittarius

No worries. I followed the link in the comment but couldn't find the exact source you used. Do you remember what is was? News or Wikipedia? And from what year?

In the end I would probably be able to find the list again in the internet.

You mean you had the wrong word list somewhere online?

Jul 31 '22 10:07 Syphdias

Nope I did not have it somewhere online, just that I downloaded it in the internet. Also the list from the link should be mixed, because there is more from every kind of text.

Jul 31 '22 10:07 BlackSagittarius

Mixed-typical then from here https://wortschatz.uni-leipzig.de/de/download/German ?

Jul 31 '22 10:07 Syphdias

Yeah the 300k mixed typical.

Jul 31 '22 10:07 BlackSagittarius

Whats the status of this pr?

Jul 31 '22 10:07 Miodec

Whats the status of this pr?

Figuring out how to get rid of non german words and maybe using a better source than I currently based it on. Give me a bit to figure it out. If I really need to analyse more by hand it could drag it out a bit. I'd say, you can ignore it until I ping you, if that's okay with you :)

Jul 31 '22 10:07 Syphdias

Sounds good

Jul 31 '22 10:07 Miodec

Yeah the 300k mixed typical.

@BlackSagittarius, hm, I still cannot generate the same set of words. I tired the 300k sample size and the 1M size. There is always words that only show up in the current file and there is also always a set of words that only show up in my word list. It's also a shame that the data set is over 10 years old now (2011)

Jul 31 '22 11:07 Syphdias

I am like 90% certain it was this list. I kinda changed a bit before submitting it already, but if there is too much of a difference, I don't know what it is.

Jul 31 '22 11:07 BlackSagittarius

Using the mixed typical data. I get about 134k words that show up more than once, then I can only use the words that came up once starting with capital A to capital O.

I am like 90% certain it was this list. I kinda changed a bit before submitting it already, but if there is too much of a difference, I don't know what it is.

Comparing what I would generate from "typical mixed 300K" (on left) with the current german_250k.json (right) using vimdiff I get stuff like this:

I could try combining multiple datasets like mixed typical (2011), news (2021), wikipedia (2021). I would need to find more names and filter them out I imagine...

Jul 31 '22 12:07 Syphdias

If you are able to do this, I absolutely encourage you. I am not able to code the slightest bit, so I pretty much had to do everything by hand. I hope you can get better results than I got.

Jul 31 '22 12:07 BlackSagittarius

I rebased and pushed the current state of progress. I did not come around to manually "scan" all words. I did manage to combine multiple lists and tired it with: wikipedia_2021_1M, mixed-typical_2011_1M, news_2021_1M, wikipedia_2021_300K, mixed-typical_2011_300K and news_2021_300K.

Since I generate the lists by script I already automatically filter for a few things:

Remove everything not a german latter (exclamation points, kommas, foreign characters)
Remove one letter words, like "m" (wtf?)
Remove abbreviations (words with only capital letters)
Remove words with at least 1 capital letter in the middle

Edit: I know where single letters come from: meters, seconds, etc.

Jul 31 '22 15:07 Syphdias

Marking as draft as its not ready to be merged

Aug 03 '22 20:08 Miodec

This PR is stale. Please trigger a re-run of the PR check action.

Aug 10 '22 20:08 github-actions[bot]

The PR check action failed. Please review the logs and make the necessary changes. https://github.com/monkeytypegame/monkeytype/actions/runs/2846952127

Aug 12 '22 13:08 github-actions[bot]

monkeytype monkeytype copied to clipboard

WIP: Update German Language Files

monkeytype
monkeytype copied to clipboard