jwordsplitter
jwordsplitter copied to clipboard
small Java library for splitting German compound words
There are some rules in Dutch that currently make JWordSplitter less fit for Dutch. Most difficult is filtering the detected compounds. autoonderdeel is not acceptable, even though auto and onderdeel...
`GermanWordSplitter` does not split words such as _FPÖ-Chefverhandler_. I think, that -- at least in German -- any word with mid-word hyphens could be decomposed in parts separated by hyphens
After testing jwordsplitter on a dataset of German technical vocabulary, a number of words have been extracted which so far had been missing in the languagetool_dict.txt and germanPrefixes.txt lists. These...
Some compounds will not be decomposed as the algorithm searches the longest match and the Morphy/LanguageTool-based dictionary contains compounds. Examples: ``` Einkommensempfängerin Wehrmachtsamt Schwingflügel ``` Solution: remove compounds from `test-de-large.txt`,...