wiktextract Support for zh-pron pronunciations

Support for zh-pron pronunciations

Open Manishearth opened this issue 2 years ago • 7 comments

I intend to work on this myself though help would be appreciated (particularly, help with https://github.com/tatuylonen/wiktextract/issues/118)

Chinese entries use Template:zh-pron, which lists pronunciations for various varieties, with an output like this:

Different varieties use different romanization schemes, often multiple. I want to be able to somehow include this in the pronunciation output.

I'm actually not sure if this should go under "sounds", or if this should be handled as a separate row given the complexity. The IPA information can be extracted by running the Lua modules but I'm not sure if it can be re-associated with the appropriate variety name.

For my purposes, having this be a separate table (ideally with the option to run the Lua code and get the additional expanded pronunciation) would be great.

Feb 25 '22 08:02 Manishearth

This was implemented in May and should be included in the current data. This was implemented by @yoskari and I haven't tested it myself.

Jun 19 '22 18:06 tatuylonen

I just downloaded the All Chinese non-inflected, non-alternative word senses JSON from here. Although it's almost double the size as a rip I have from 2 weeks ago, some words still seem to be missing pronunciation info.

For example the word 垃圾. In the JSON, I can see that there's valid pronunciation info for Hakka and Min Nan, but all other pronunciation info is missing. Ideally, at least Mandarin would be provided.

I could be downloading the wrong JSON, so let me know.

Jun 28 '22 07:06 seth-js

Good catch, Oskari is taking a look at it, and I think the issue is pretty clearly with the "Pronunciation 1" and "Pronunciation 2" pseudo-etymology blocks.

From 垃圾, what's missing the first block at the top of the article in Pronunciation 1, while the second block of pronunciations seems to be fine from Pronunciation 2.

Jun 28 '22 11:06 kristian-clausal

The issue seems to be that with multiple pronunciation tables any previous tables get overwritten by the last one, despite being handled properly on their own. It is as of now unclear to me why this happens.

Jun 28 '22 16:06 yoskari

@seth-js The bug should now be fixed.

Jun 29 '22 10:06 yoskari

I assume I'll have to wait until the next dump to verify, but thanks for fixing this.

Jun 29 '22 21:06 seth-js

I can now see the missing pronunciation entries in the latest dump.

Jul 10 '22 05:07 seth-js

wiktextract wiktextract copied to clipboard

Support for zh-pron pronunciations

wiktextract
wiktextract copied to clipboard