dkpro-jwktl
dkpro-jwktl copied to clipboard
jwktl hangs on russian orthography
Using English version.
Latest wiktionary downloaded and parsed.
Using gradle: compile 'com.github.dkpro:dkpro-jwktl:56499bdaab' to obtain latest snapshot.
I'm analyzing text from various sources and some Russian (I presume) text is in my test data, the operation "wkt.getEntriesForWord("Статтю", true);" hangs like it is in an infinite loop.
Was expecting an empty entries list, not app hang.
Example term: Статтю
Not really an infinite loop, but definitely unexpected behavior. As a quick-fix, you can remove the boolean param (i.e., use wkt.getEntriesForWord("Статтю");
instead. Normalization of titles is not supported for non-Latin alphabets and causes this issue also for other, e.g., Russian entries. I'll see if I can solve the actual issue in one of the later versions. Please report back if removing the normalization param helps for you.
As normalization is really wanted, we have instead implemented a step where we run the terms through IBM's icu4j to generate "ascii/ansi" transliterations for any language charset that doesn't fit within the normal English/Western European range that is not pruned by an earlier initial language check process. So far so good.
And we've worked out a methodology (via gradle) to auto build the DB when an updated wiktionary dump is available and shove it into a jar with a utility routine to extract the DB to a temp folder when needed to create a Wiktionary instance.
On 8/19/19 11:30 AM, Christian M. Meyer wrote:
Not really an infinite loop, but definitely unexpected behavior. As a quick-fix, you can remove the boolean param (i.e., use |wkt.getEntriesForWord("Статтю");|instead. Normalization of titles is not supported for non-Latin alphabets and causes this issue also for other, e.g., Russian entries. I'll see if I can solve the actual issue in one of the later versions. Please report back if removing the normalization param helps for you.
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/dkpro/dkpro-jwktl/issues/72?email_source=notifications&email_token=ABHY72MLCJMHEZQDWDFEJ5DQFK4AFA5CNFSM4IMB37UKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD4TLBFY#issuecomment-522629271, or mute the thread https://github.com/notifications/unsubscribe-auth/ABHY72ILPWN6XDPPM2WWOV3QFK4AFANCNFSM4IMB37UA.