jwordsplitter
jwordsplitter copied to clipboard
Lexicon expansion
After testing jwordsplitter on a dataset of German technical vocabulary, a number of words have been extracted which so far had been missing in the languagetool_dict.txt and germanPrefixes.txt lists. These words have been included and the tests have been adjusted accordingly. Further testing may result in more suggestions for words to be added.
Thanks. Have you checked the comment at languagetool-dict.txt
- it's an export from LanguageTool, any changes would be lost with the next update. Or have you run a new export?
Thanks for the hint! No, I haven't run a new export 😕
If I may ask... since unlike for the other languages, there's no german.dict
in org/languagetool/resource/de/
, are you using the de_DE.dict
in org/languagetool/resource/de/hunspell
as the German dictionary? I'd like to check the whole LanguageTool-lexicon for German to not add any duplicates to added.txt
No, we've used german.dict
and added.txt
, which are not used for spelling but contain part-of-speech information.
Ok great, thanks for the quick response! 😊
Do the tags and variants for German words which are unknown to the tagger - as e.g. "Alufelge" - have to be added manually? I assume that words without tags shouldn't be added to the added.txt
but the format should be token - lemma - PoS-tag 🤔
added.txt
is part of LanguageTool, so I guess you're talking about that. As a compound, it's decompounded by jwordpsplitter. If that doesn't work, adding alu
and/or felge
to additions.txt
in jwordsplitter should help. Or you can add the compound to added.txt
in LanguageTool using the format you mentioned.
yes, sorry, as you suggested I moved to languageTool to add all missing words there. Only that for each missing word the according tag has to be obtained from the tagger, and there are quite a few words the tagger doesn't know. I was wondering if there was any other automatic way to get the PoS info if not provided by the tagger or if I would have to manually annotate the 700 words from my list?
If the tagger doesn't know the words, then there's not much you can do other than add them manually. Could you post some examples of unkown words?
Sure, so some examples would be: Acryl, Bändchen, befüllen, Freundlichkeit, Inspektion, lila, Mountainbike, PH-Wert, schließ (as Prefix or short verb form), techno, vertikal, x-fach. The file below contains all words: wordsTaggerUnknown.txt
Thanks, I've forwarded this to Julian of korrekturen.de, who helps us maintain the dictionary. Maybe he will add those words. But in any case it will take some time until they end up in LT.
Cool 😊 I'll let you know if I find more words when checking the splitter on other data sets