10ten-ja-reader Long expressions

For some reason, Rikai can't recognize the expression 先生と言われるほどの馬鹿でなし.

It's been in JMdict since last year, though: https://www.edrdg.org/jmdictdb/cgi-bin/entr.py?svc=jmdict&sid=&e=2070202

Jul 07 '20 04:07 nicolasmaia

Actually, ignore my previous comment. We have a hard-coded limit for how long a string we look up. It's currently set at 13 characters.

Jul 07 '20 04:07 birtles

Can we extend it?

Jul 07 '20 05:07 nicolasmaia

Sure. We should probably do some analysis on the database to see what the longest entries are.

There will be some performance impact too since we'll do a number of more lookups than we need to, but hopefully it's acceptable? It would be nice if we could do some performance comparison though.

Jul 07 '20 05:07 birtles

1       3193
2       58979
3       61481
4       75303
5       49891
6       39011
7       30063
8       21971
9       13759
10      8512
11      5391
12      3409
13      2072
14      1332
15      889
16      554
17      371
18      236
19      168
20      95
21      62
22      51
23      35
24      17
25      16
26      11
27      7
28      3
29      3
30      2
31      2
32      1
33      3
34      2
35      0
36      0
37      1

This is from the bundled dict.idx. Didn't find anything longer than 37 full width characters. The numbers include both kana-only and mixed kana-kanji entries.

Jul 08 '20 14:07 SaltfishAmi

Thanks! That's super useful.

I'm concerned that once I (finally) switch to storing the data in IndexedDB the extra lookup times for each substring are going produce a pretty significant performance impact.

Perhaps we could extend the limit to 16 for now? 19 at most?

Jul 09 '20 00:07 birtles

Thanks! That's super useful.

I'm concerned that once I (finally) switch to storing the data in IndexedDB the extra lookup times for each substring are going produce a pretty significant performance impact.

Perhaps we could extend the limit to 16 for now? 19 at most?

Or maybe this should be adjustable in the options page?

Jul 13 '20 05:07 SaltfishAmi

Thanks! That's super useful. I'm concerned that once I (finally) switch to storing the data in IndexedDB the extra lookup times for each substring are going produce a pretty significant performance impact. Perhaps we could extend the limit to 16 for now? 19 at most?

Or maybe this should be adjustable in the options page?

This!

Aug 04 '20 02:08 Lebon14

I've updated the hard-coded limit to 16 for now in: https://github.com/birtles/rikaichamp/commit/91846e16fc22c197f5c5188be595e8ef3304b5e8

I try really hard to avoid adding options so I'd prefer to wait until I can profile the performance difference properly and, if there's no noticeable impact, just increase the limit to 37.

Aug 11 '20 00:08 birtles

10ten-ja-reader 10ten-ja-reader copied to clipboard

Long expressions

10ten-ja-reader
10ten-ja-reader copied to clipboard