jwordsplitter icon indicating copy to clipboard operation
jwordsplitter copied to clipboard

Lexicon expansion

Open GiPfi opened this issue 7 years ago • 11 comments

After testing jwordsplitter on a dataset of German technical vocabulary, a number of words have been extracted which so far had been missing in the languagetool_dict.txt and germanPrefixes.txt lists. These words have been included and the tests have been adjusted accordingly. Further testing may result in more suggestions for words to be added.

GiPfi avatar Oct 18 '17 13:10 GiPfi

Thanks. Have you checked the comment at languagetool-dict.txt - it's an export from LanguageTool, any changes would be lost with the next update. Or have you run a new export?

danielnaber avatar Oct 18 '17 15:10 danielnaber

Thanks for the hint! No, I haven't run a new export 😕

If I may ask... since unlike for the other languages, there's no german.dict in org/languagetool/resource/de/, are you using the de_DE.dict in org/languagetool/resource/de/hunspell as the German dictionary? I'd like to check the whole LanguageTool-lexicon for German to not add any duplicates to added.txt

GiPfi avatar Oct 19 '17 12:10 GiPfi

No, we've used german.dict and added.txt, which are not used for spelling but contain part-of-speech information.

danielnaber avatar Oct 19 '17 13:10 danielnaber

Ok great, thanks for the quick response! 😊

GiPfi avatar Oct 19 '17 14:10 GiPfi

Do the tags and variants for German words which are unknown to the tagger - as e.g. "Alufelge" - have to be added manually? I assume that words without tags shouldn't be added to the added.txt but the format should be token - lemma - PoS-tag 🤔

GiPfi avatar Oct 24 '17 12:10 GiPfi

added.txt is part of LanguageTool, so I guess you're talking about that. As a compound, it's decompounded by jwordpsplitter. If that doesn't work, adding alu and/or felge to additions.txt in jwordsplitter should help. Or you can add the compound to added.txt in LanguageTool using the format you mentioned.

danielnaber avatar Oct 24 '17 12:10 danielnaber

yes, sorry, as you suggested I moved to languageTool to add all missing words there. Only that for each missing word the according tag has to be obtained from the tagger, and there are quite a few words the tagger doesn't know. I was wondering if there was any other automatic way to get the PoS info if not provided by the tagger or if I would have to manually annotate the 700 words from my list?

GiPfi avatar Oct 24 '17 12:10 GiPfi

If the tagger doesn't know the words, then there's not much you can do other than add them manually. Could you post some examples of unkown words?

danielnaber avatar Oct 24 '17 13:10 danielnaber

Sure, so some examples would be: Acryl, Bändchen, befüllen, Freundlichkeit, Inspektion, lila, Mountainbike, PH-Wert, schließ (as Prefix or short verb form), techno, vertikal, x-fach. The file below contains all words: wordsTaggerUnknown.txt

GiPfi avatar Oct 24 '17 13:10 GiPfi

Thanks, I've forwarded this to Julian of korrekturen.de, who helps us maintain the dictionary. Maybe he will add those words. But in any case it will take some time until they end up in LT.

danielnaber avatar Oct 24 '17 13:10 danielnaber

Cool 😊 I'll let you know if I find more words when checking the splitter on other data sets

GiPfi avatar Oct 26 '17 06:10 GiPfi