monkeytype
monkeytype copied to clipboard
WIP: Update German Language Files
I noticed some of the German word being capitalised, when they should not – unless at the start of the sentence. I believe this happened because the sentences analysed where not normalized to fix this issue.
I believe that in monkeytype without punctuation the words should be capitalised (or not) as if they were in the middle of a sentence. Only if punctuation is enabled, words should get capitalized after a period.
To fix this I took a look at the original source mentioned in the comment section of the JSON files. I wrote a script to download the frequency map, correct capitalization, remove non-German words, abbreviations and words from other languages.
I do not believe this list of corrections to be exhaustive and it might need adjustments in the future. Please feel free to open a PR against the script I used: https://github.com/Syphdias/monkeytype-generate-german
I have a few things where I need help with
- What is preferred as a base for analysis? The source provides a frequency list for "news" and for "wikipedia". In this PR I use "news" because I felt the current words were taken from that as well. Which one would be better?
- The guide line says to not include "swear words". With a quick search in
german_250k.json
I found at least a few obvious ones like "Arschloch" (arsehole) and "scheiße" (shit/crap). How serious is this requirement? And should I try to find a few obvious ones and filter them out? There is no guarantee that I can get all of them. What do you thin?
The PR check action failed. Please review the logs and make the necessary changes. https://github.com/monkeytypegame/monkeytype/actions/runs/2632719588
- What is preferred as a base for analysis? The source provides a frequency list for "news" and for "wikipedia". In this PR I use "news" because I felt the current words were taken from that as well. Which one would be better?
Whichever feels more "natural" to type. Whichever includes the most common words that are used.
The guide line says to not include "swear words". With a quick search in german_250k.json I found at least a few obvious ones like "Arschloch" (arsehole) and "scheiße" (shit/crap). How serious is this requirement? And should I try to find a few obvious ones and filter them out? There is no guarantee that I can get all of them. What do you thin?
Keep the swear words to an absolute minimum. PG13 please
Keep the swear words to an absolute minimum. PG13 please
I cannot guarantee that. I'll try to get the most obvious ones. But to be clear, I found all of them in the current codebase as well. I'll get to it on the weekend probably.
So, I have a few more questions.
- Should the following be included?
- people names (I'd say no)
- company names (I'd say no)
- names of counties/places (yes, because they can be different in different languages)
- I noticed the lists not being sorted alphabetically – at least the non-250k ones. Is this on purpose because of frequency or should I sort all language files?
So, I have a few more questions.
Should the following be included?
- people names (I'd say no)
- company names (I'd say no)
- names of counties/places (yes, because they can be different in different languages)
I noticed the lists not being sorted alphabetically – at least the non-250k ones. Is this on purpose because of frequency or should I sort all language files?
I agree on the names. Sorting doesn't matter.
This PR is stale. Please trigger a re-run of the PR check action.
First of: Sorry, for letting this go stale. Looking through hundred-thousands of words isn't exactly fun.
I believe, @BlackSagittarius, you initially contributed the 250k german words list. I was wondering how you filtered for bogus inputs in the list. For example I found some dutch words for some reason.
To clarify: After removing a lot of bogus there are only about 206k words left that appear more than once in the source texts. This means that everything that follows is only mentioned once and to fill up to 250k it ends up with words about with capital A to words with capital E. I hardly scratched the As since it is either weird German words ("Abdomentransversaldurchmesser" – "Gesundheit") or it is Footballers names starting with A or it is places I have never heard of.
@Syphdias Sorry for not reacting to the problem earlier. I was the one who contributed the list and I just checked it very briefly and absolutely not by hand. That's why there are a few pr maybe a lot of wrong words in there. So what I did was making a list while typing and when I see a wrong word I just put it in my list and then at a later point make a big PR with all things I got. I thought that having a 250k list with around 0,25% wrong words would be better than having non and so I sticked with this idea.
I also was the one who contributed pretty much everything for German, so you can just ask me if you have anything else, like the capitalized words in the 1k and 10k lists. Though this seems to be sorted by now...
Do you still happen to have the list with bogus words?
Maybe. Though I can't look atm...
In the end I would probably be able to find the list again in the internet.
No worries. I followed the link in the comment but couldn't find the exact source you used. Do you remember what is was? News or Wikipedia? And from what year?
In the end I would probably be able to find the list again in the internet.
You mean you had the wrong word list somewhere online?
Nope I did not have it somewhere online, just that I downloaded it in the internet. Also the list from the link should be mixed, because there is more from every kind of text.
Mixed-typical then from here https://wortschatz.uni-leipzig.de/de/download/German ?
Yeah the 300k mixed typical.
Whats the status of this pr?
Whats the status of this pr?
Figuring out how to get rid of non german words and maybe using a better source than I currently based it on. Give me a bit to figure it out. If I really need to analyse more by hand it could drag it out a bit. I'd say, you can ignore it until I ping you, if that's okay with you :)
Sounds good
Yeah the 300k mixed typical.
@BlackSagittarius, hm, I still cannot generate the same set of words. I tired the 300k sample size and the 1M size. There is always words that only show up in the current file and there is also always a set of words that only show up in my word list. It's also a shame that the data set is over 10 years old now (2011)
I am like 90% certain it was this list. I kinda changed a bit before submitting it already, but if there is too much of a difference, I don't know what it is.
Using the mixed typical data. I get about 134k words that show up more than once, then I can only use the words that came up once starting with capital A to capital O.
I am like 90% certain it was this list. I kinda changed a bit before submitting it already, but if there is too much of a difference, I don't know what it is.
Comparing what I would generate from "typical mixed 300K" (on left) with the current german_250k.json
(right) using vimdiff
I get stuff like this:
I could try combining multiple datasets like mixed typical (2011), news (2021), wikipedia (2021). I would need to find more names and filter them out I imagine...
If you are able to do this, I absolutely encourage you. I am not able to code the slightest bit, so I pretty much had to do everything by hand. I hope you can get better results than I got.
I rebased and pushed the current state of progress. I did not come around to manually "scan" all words. I did manage to combine multiple lists and tired it with: wikipedia_2021_1M, mixed-typical_2011_1M, news_2021_1M, wikipedia_2021_300K, mixed-typical_2011_300K and news_2021_300K.
Since I generate the lists by script I already automatically filter for a few things:
- Remove everything not a german latter (exclamation points, kommas, foreign characters)
- Remove one letter words, like "m" (wtf?)
- Remove abbreviations (words with only capital letters)
- Remove words with at least 1 capital letter in the middle
Edit: I know where single letters come from: meters, seconds, etc.
Marking as draft as its not ready to be merged
This PR is stale. Please trigger a re-run of the PR check action.
The PR check action failed. Please review the logs and make the necessary changes. https://github.com/monkeytypegame/monkeytype/actions/runs/2846952127