wiktextract icon indicating copy to clipboard operation
wiktextract copied to clipboard

Support for zh-pron pronunciations

Open Manishearth opened this issue 2 years ago • 7 comments

I intend to work on this myself though help would be appreciated (particularly, help with https://github.com/tatuylonen/wiktextract/issues/118)

Chinese entries use Template:zh-pron, which lists pronunciations for various varieties, with an output like this:

image

Different varieties use different romanization schemes, often multiple. I want to be able to somehow include this in the pronunciation output.

I'm actually not sure if this should go under "sounds", or if this should be handled as a separate row given the complexity. The IPA information can be extracted by running the Lua modules but I'm not sure if it can be re-associated with the appropriate variety name.

For my purposes, having this be a separate table (ideally with the option to run the Lua code and get the additional expanded pronunciation) would be great.

Manishearth avatar Feb 25 '22 08:02 Manishearth

This was implemented in May and should be included in the current data. This was implemented by @yoskari and I haven't tested it myself.

tatuylonen avatar Jun 19 '22 18:06 tatuylonen

I just downloaded the All Chinese non-inflected, non-alternative word senses JSON from here. Although it's almost double the size as a rip I have from 2 weeks ago, some words still seem to be missing pronunciation info.

For example the word 垃圾. In the JSON, I can see that there's valid pronunciation info for Hakka and Min Nan, but all other pronunciation info is missing. Ideally, at least Mandarin would be provided.

I could be downloading the wrong JSON, so let me know.

seth-js avatar Jun 28 '22 07:06 seth-js

Good catch, Oskari is taking a look at it, and I think the issue is pretty clearly with the "Pronunciation 1" and "Pronunciation 2" pseudo-etymology blocks.

From 垃圾, what's missing the first block at the top of the article in Pronunciation 1, while the second block of pronunciations seems to be fine from Pronunciation 2.

kristian-clausal avatar Jun 28 '22 11:06 kristian-clausal

The issue seems to be that with multiple pronunciation tables any previous tables get overwritten by the last one, despite being handled properly on their own. It is as of now unclear to me why this happens.

yoskari avatar Jun 28 '22 16:06 yoskari

@seth-js The bug should now be fixed.

yoskari avatar Jun 29 '22 10:06 yoskari

I assume I'll have to wait until the next dump to verify, but thanks for fixing this.

seth-js avatar Jun 29 '22 21:06 seth-js

I can now see the missing pronunciation entries in the latest dump.

seth-js avatar Jul 10 '22 05:07 seth-js