jmdict-simplified icon indicating copy to clipboard operation
jmdict-simplified copied to clipboard

Extract specific language

Open rbleuse opened this issue 3 years ago • 1 comments

Hi,

The json is converted from JMdict_e, but JMdict is also available in French, German, Russian and Dutch.

Is it possible to extract data from a specific language code ? Something like a language code ./gradlew download fr If no lang code is provided, default would be english

rbleuse avatar Jul 26 '20 18:07 rbleuse

Hello @rbleuse

Yes, that's possible, but quite a lot of work because of the sheer size of the multilingual JMdict file. (Languages other than English don't have their separate files.) Memory limitations are of the primary concern, right now conversion runs on 6 GB of RAM, and a bigger file would require splitting a processing piece-by-piece as I did for JMnedict.

But that would be a great feature, so I'll definitely look into that.

scriptin avatar Jul 26 '20 19:07 scriptin

Hello @rbleuse! Is this still relevant? If so, I have a question about your use case:

Do you need versions for each language separately, or something like French+English, Deutch+English, etc.? The reason I'm asking is that most languages have pretty small numbers of items translated into them. See the table below. Thus, it may be useful to have English as a default, included in every other version. Or, maybe it's fine to have small language-specific versions.

Let me know what you think. I can do it either way, but I don't want to clutter releases with versions which nobody will use.

Language # of entries
all 198680
eng 198680
ger 123519
rus 67379
hun 41803
dut 40964
spa 34110
fre 15307
swe 14562
slv 8757

scriptin avatar Dec 25 '22 19:12 scriptin

I personally use your JSON to generate a Realm database for mobile, so it would be fine either way in my use case... However since the data files are large the goal is to allow users to download only the relevant dictionary sets for their needs. So I split my Realm files and would encourage the same for use cases where the JSON is used by users directly.

I imagine a user who selects, say Dutch may want to also download German rather than English.

aehlke avatar Jan 03 '23 15:01 aehlke

This makes a lot of sense, @aehlke! I am only worried about the granularity because if you have, as in your example, Dutch+German as a single file, this may be easier for some users, compared to a scenario when they need to download and import 2 files separately (one for Dutch, another one for German).

I will do some testing and will add language-specific builds some time in the near future.

scriptin avatar Jan 03 '23 16:01 scriptin

What I meant with that comment is that I can see valuable use cases for users to select individual language packs rather than defaulting to always including English or always bundling combinations

aehlke avatar Jan 03 '23 17:01 aehlke

All languages present in JMdict will be included in the next scheduled release as separate JSON files.

scriptin avatar Jan 08 '23 19:01 scriptin