dkpro-jwktl icon indicating copy to clipboard operation
dkpro-jwktl copied to clipboard

jwktl hangs on russian orthography

Open michael-newsrx opened this issue 5 years ago • 2 comments

Using English version.

Latest wiktionary downloaded and parsed.

Using gradle: compile 'com.github.dkpro:dkpro-jwktl:56499bdaab' to obtain latest snapshot.

I'm analyzing text from various sources and some Russian (I presume) text is in my test data, the operation "wkt.getEntriesForWord("Статтю", true);" hangs like it is in an infinite loop.

Was expecting an empty entries list, not app hang.

Example term: Статтю

michael-newsrx avatar Aug 15 '19 21:08 michael-newsrx

Not really an infinite loop, but definitely unexpected behavior. As a quick-fix, you can remove the boolean param (i.e., use wkt.getEntriesForWord("Статтю");instead. Normalization of titles is not supported for non-Latin alphabets and causes this issue also for other, e.g., Russian entries. I'll see if I can solve the actual issue in one of the later versions. Please report back if removing the normalization param helps for you.

chmeyer avatar Aug 19 '19 15:08 chmeyer

As normalization is really wanted, we have instead implemented a step where we run the terms through IBM's icu4j to generate "ascii/ansi" transliterations for any language charset that doesn't fit within the normal English/Western European range that is not pruned by an earlier initial language check process. So far so good.

And we've worked out a methodology (via gradle) to auto build the DB when an updated wiktionary dump is available and shove it into a jar with a utility routine to extract the DB to a temp folder when needed to create a Wiktionary instance.

On 8/19/19 11:30 AM, Christian M. Meyer wrote:

Not really an infinite loop, but definitely unexpected behavior. As a quick-fix, you can remove the boolean param (i.e., use |wkt.getEntriesForWord("Статтю");|instead. Normalization of titles is not supported for non-Latin alphabets and causes this issue also for other, e.g., Russian entries. I'll see if I can solve the actual issue in one of the later versions. Please report back if removing the normalization param helps for you.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/dkpro/dkpro-jwktl/issues/72?email_source=notifications&email_token=ABHY72MLCJMHEZQDWDFEJ5DQFK4AFA5CNFSM4IMB37UKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD4TLBFY#issuecomment-522629271, or mute the thread https://github.com/notifications/unsubscribe-auth/ABHY72ILPWN6XDPPM2WWOV3QFK4AFANCNFSM4IMB37UA.

michael-newsrx avatar Aug 19 '19 16:08 michael-newsrx