pyglossary icon indicating copy to clipboard operation
pyglossary copied to clipboard

Read json produced by wiktextract

Open holyspiritomb opened this issue 2 years ago • 1 comments

Is your feature request related to a problem? Please describe. wiktextract is a tool for extracting data from xml dumps of the English-language Wiktionary. It is useful for filtering data such that the output yields machine-readable one-directional (lang)-to-English bilingual dictionaries in jsonl format, where a headword's entry contains data for all of its possible senses, all of its inflections, and example usage. Inflections have their own entries as well that say which inflection to what headword.

Currently, pyglossary doesn't support reading these files.

**Describe the solution you'd like **

I'd like pyglossary to support reading these specific jsonl files for conversion to other formats, if possible with an option to list inflections in headwords' definitions in the output.

Provide links to the official website and/or download page of the related software or format.

wiktextract project page index of machine readable dictionaries created with wiktextract

Provide sample file(s) for the format/feature you want to be supported

Full sized file I'd ultimately like to convert to stardict and mobi: Latin-to-English (~860MB) Smaller versions of above file for testing: Latin-to-English 1.1MB, Latin-to-English first ten entries

holyspiritomb avatar Mar 31 '22 18:03 holyspiritomb

I'd like to mention that calling this file "a json file" is a mistake. Every line is a json, but the whole file is not a valid json. We should call it "json-lines" or something like that.

ilius avatar Apr 03 '22 06:04 ilius

I pushed to master branch. Please give it a try.

ilius avatar Mar 17 '23 17:03 ilius

Please open new issues for further bugs / improvements.

ilius avatar Mar 19 '23 12:03 ilius