wiktextract icon indicating copy to clipboard operation
wiktextract copied to clipboard

Add preliminary support for Chinese Wiktionary dump file

Open xxyzz opened this issue 1 year ago • 1 comments

I added a new command line option dump_file_language_code to initialize subtitles. I didn't get the code from the dump file name because some files don't use language code, for example: "zh_min_nanwiktionary" doesn't use its code "nan". Subtitle text data are stored in JSON files in the data folder and loaded according to the language code then saved in the WiktionaryConfig onject.

I changed the --language option to accept the language code instead of the language name, because different dump files use different language name for the same language code.

I also fixed some errors and removed some unused imports and variables.

Some code are trickier to change, like this one: https://github.com/tatuylonen/wiktextract/blob/c844146702094265d1d5deb25883234e1ac2a61e/wiktextract/inflectiondata.py#L5835-L5838

the function check_v is not called inside any function, so the config object can't be passed.

And some code don't seem to work as intended like this line only checks if "noun" is in the PARTS_OF_SPEECH set:

https://github.com/tatuylonen/wiktextract/blob/c844146702094265d1d5deb25883234e1ac2a61e/wiktextract/inflection.py#L926-L935

so I didn't change them.

Wiktionary dump files for other languages can also be parsed by creating similar JSON files, resolves #92.

xxyzz avatar Sep 11 '22 23:09 xxyzz

This pull request requires https://github.com/tatuylonen/wikitextprocessor/pull/13

xxyzz avatar Sep 16 '22 04:09 xxyzz

I merged the pull request. Thank you for your great contributions!

One note, please don't use the tuple[str] type hint syntax quite yet. There are still many systems in use that have somewhat older Python versions that don't support it. This applies to me too - my desktop has 3.8 and offers no upgrade option without also upgrading the OS, which I'm unwilling to do right now.

tatuylonen avatar Oct 05 '22 18:10 tatuylonen

There are also a few features in the "get_languages.py" file require newer Python: match, str.removeprefix and type hints.

Just a friendly reminder, both 3.8 and 3.9 are end of full support, they only have source-only security fixes(https://en.wikipedia.org/wiki/History_of_Python#Table_of_versions).

xxyzz avatar Oct 06 '22 00:10 xxyzz