koreader Japanese support fails to finds the longest selection in half of the cases and only goes for one character

On version 2022.07 and before on android.

In the following example pasted from the source code in the epub file (it is Miyamoto Musashi by Yoshikawa to be found on aozora):

<ruby><rb>年端</rb>〘<rt>としは</rt>〙</ruby>もゆかない小娘が、しかも夜、ただひとり月の下で、無数の死骸の中にかくれ、いったい、何を働いているのか。

The Japanese support plugin fails to select for the longest possible string of characters in the following words.

In these cases it selects only the first character

小娘
無数
死骸
のか

In these cases it selects a subset but not the longest possible string:

いった instead of いったい
働いてい instead of 働いている

It's been like this since I first tried it. Am I the only one seeing this behaviour?

@cyphar

Aug 11 '22 13:08 rtega

I tried only enabling one dictionary and disabling all but one dictionary. All to no avail. It exhibits the exact same behaviour.

Aug 11 '22 14:08 rtega

That's very strange, I have never seen this behaviour in all of my testing nor personal use of this plugin. Since 青空 is public domain, can you give me the exact EPUB that you had an issue with (since my understanding is that different 青空文庫 websites generate the EPUBs differently)?

Also which platform do you see this bug on? Do you see it in the emulator, or on Android, or on some other platform?

Aug 11 '22 21:08 cyphar

Sure. The dictionary I'm using is apparently to big to upload so I can't send you this.

I'm seeing this on Kobo and android. Not in the emulator. I think I tried it with personally generated epubs from aozora and even from their htmls.

Maybe I'm doing something incredibly wrong, but I can't see why it would work on half of the words and not on the other half. problem.tar.gz

Aug 11 '22 22:08 rtega

Can I help debugging this somehow? I looked at the logcat while I'm using the dictionary but I don't see anything suspicious.

Sep 12 '22 13:09 rtega

I updated to 2022.08 and still the same issue. I converted the text to txt and still the same issue.

Sep 15 '22 14:09 rtega

Sorry, I've been on vacation and while I thought I'd have time to debug this, I didn't. I'll get back home in a few days and I'll try to reproduce and debug this issue then.

Sep 16 '22 02:09 cyphar

Thanks!

Sep 16 '22 06:09 rtega

I can't reproduce this in the emulator nor on my Android phone (trying to select the same text you mentioned in chapter 4). Are you sure that:

The words you expect to be auto-selected have headwords in the dictionary? On Android you can't input kanji, but you can look them up by their reading and confirm that they are present.
That you are not dragging the selection, changing the selected words? (On my tablet at least, visual feedback can be delayed, leading to strange effects.)

Have you had this issue on more than one Android device? I can try to reproduce this on my Android eReader later.

Sep 18 '22 01:09 cyphar

Affirmative to both questions.

The words you expect to be auto-selected have headwords in the dictionary? On Android you can't input kanji, but you can look them up by their reading and confirm that they are present.

I can select the text manually as it should and get the results from the dictionary.

That you are not dragging the selection, changing the selected words? (On my tablet at least, visual feedback can be delayed, leading to strange effects.)

I thought that as well but no, it is entirely consistent: it always fails on the exact same words in the sentence.

I have the same issue on my kobo device and two android devices (an old samsung s5 and the current samsung S10+ on android 12).

Maybe there is some seemingly unrelated setting that's influencing the behavior?

Sep 18 '22 11:09 rtega

Same behaviour in various kind of documents, and not only in the one document you're always testing with? Be sure all the parts of the "word" you try/expect to be selected are in the same HTML node (select text around and View HTML) - text selection at HTML node boundaries may stop at that boundary - may happen also with western text (ie. when words are capitalized with spans, ie <span>The <span style="font-size: 110%">G</span>overnment of blah blah</span> will get it hard to select "Government".)

Sep 18 '22 11:09 poire-z

Since you've mentioned Kobo, if the file has been mangled into a Kobo ePub, there may be extra spans screwing with this ;).

Sep 18 '22 16:09 NiLuJe

I checked on the node boundaries as well. That's not the problem either. I upload my files via ssh so this shouldn't be a problem.

Sep 18 '22 16:09 rtega

What do you see in the logs? The Japanese plugin outputs a debug log entry like japanese.koplugin: attempted X expansions up to Y when doing word selection expansion which should tell you when the expansions stopped.

One thing I noticed is that <ruby><rb>年端</rb>〘<rt>としは</rt>〙</ruby> is a slightly weird usage of ruby tags and I managed to get koreader to search 年〘〙端 by manually selecting the text, but the autoselection still worked (and many of the words you had issues with did not have furigana).

Sep 19 '22 01:09 cyphar

I don't see any messages like that in adb logcat. Do I have to enable something to see these?

Sep 19 '22 09:09 rtega

Ah, ok, I have to enable debug logging.

Sep 19 '22 09:09 rtega

japplugin-snip.log

You should start to look from time "09-19 13:54:13.652" on line 3580.

This is the log with debug logging and verbose debug logging enabled. I get exactly the same results with different dictionaries. I can confirm by manually selecting the text that the headwords are in my dictionaries. My dictionaries have been generated with stardict-tools.

Sep 19 '22 12:09 rtega

Finally getting somewhere. It looks as though disabled dictionaries are influencing the process. I see references to dictionaries that should be disabled when I'm looking up. I removed all the dictionaries that I had installed and only enable the one I'm using and pronto: it works the way it should.

Sep 19 '22 12:09 rtega

It's getting even weirder: after removing all dictionaries and putting them back one by one I thought I had two that seem to be causing issues. After removing them and putting them back (exact same files in the exact same location) now everything seems to be working fine...

In any case, when things go wrong I get lines like this in the log file:

09-19 13:54:06.631 31548 31666 W KOReader: JSON data cannot be decoded rocks/share/lua/5.1/json/decode/util.lua:35: unexpected character @ character: 419 0:419 ["] line:

Sep 19 '22 12:09 rtega

Random thought: is the plugin properly honoring the "enabled" list (and its order) of dictionaries?

(It's a sneaky enough feature that I can see someone forgoing to deal with it).

Sep 19 '22 18:09 NiLuJe

Nope, forget about it, it happens when sdcv segfaults ;) (signal 11 is SIGSEGV).

Which makes this... a sneaky duplicate of #9515 ;).

So, same answer: try with today's nightly?

Sep 19 '22 18:09 NiLuJe

I tried with the latest nightly which seems to fix this issue.

Sep 22 '22 14:09 rtega