10ten-ja-reader icon indicating copy to clipboard operation
10ten-ja-reader copied to clipboard

Long expressions

Open nicolasmaia opened this issue 5 years ago • 8 comments

For some reason, Rikai can't recognize the expression 先生と言われるほどの馬鹿でなし.

It's been in JMdict since last year, though: https://www.edrdg.org/jmdictdb/cgi-bin/entr.py?svc=jmdict&sid=&e=2070202

nicolasmaia avatar Jul 07 '20 04:07 nicolasmaia

Actually, ignore my previous comment. We have a hard-coded limit for how long a string we look up. It's currently set at 13 characters.

birtles avatar Jul 07 '20 04:07 birtles

Can we extend it?

nicolasmaia avatar Jul 07 '20 05:07 nicolasmaia

Sure. We should probably do some analysis on the database to see what the longest entries are.

There will be some performance impact too since we'll do a number of more lookups than we need to, but hopefully it's acceptable? It would be nice if we could do some performance comparison though.

birtles avatar Jul 07 '20 05:07 birtles

1       3193
2       58979
3       61481
4       75303
5       49891
6       39011
7       30063
8       21971
9       13759
10      8512
11      5391
12      3409
13      2072
14      1332
15      889
16      554
17      371
18      236
19      168
20      95
21      62
22      51
23      35
24      17
25      16
26      11
27      7
28      3
29      3
30      2
31      2
32      1
33      3
34      2
35      0
36      0
37      1

This is from the bundled dict.idx. Didn't find anything longer than 37 full width characters. The numbers include both kana-only and mixed kana-kanji entries.

SaltfishAmi avatar Jul 08 '20 14:07 SaltfishAmi

Thanks! That's super useful.

I'm concerned that once I (finally) switch to storing the data in IndexedDB the extra lookup times for each substring are going produce a pretty significant performance impact.

Perhaps we could extend the limit to 16 for now? 19 at most?

birtles avatar Jul 09 '20 00:07 birtles

Thanks! That's super useful.

I'm concerned that once I (finally) switch to storing the data in IndexedDB the extra lookup times for each substring are going produce a pretty significant performance impact.

Perhaps we could extend the limit to 16 for now? 19 at most?

Or maybe this should be adjustable in the options page?

SaltfishAmi avatar Jul 13 '20 05:07 SaltfishAmi

Thanks! That's super useful. I'm concerned that once I (finally) switch to storing the data in IndexedDB the extra lookup times for each substring are going produce a pretty significant performance impact. Perhaps we could extend the limit to 16 for now? 19 at most?

Or maybe this should be adjustable in the options page?

This!

Lebon14 avatar Aug 04 '20 02:08 Lebon14

I've updated the hard-coded limit to 16 for now in: https://github.com/birtles/rikaichamp/commit/91846e16fc22c197f5c5188be595e8ef3304b5e8

I try really hard to avoid adding options so I'd prefer to wait until I can profile the performance difference properly and, if there's no noticeable impact, just increase the limit to 37.

birtles avatar Aug 11 '20 00:08 birtles