wikipron
wikipron copied to clipboard
Massively multilingual pronunciation mining
Do you think there's a reasonable way to make an enhancement that will extract audio file URLs for Wiktionary words? At least for Armenian, the audio files are linked in...
Currently we are using the ISO-639-2 "bibliographic" codes ("ger" for German). It seems to me that these are not terribly widely used and make compatibility with other multilingual resources poorer...
* A script for computing KPI numbers (languages, dialects, scripts, and number of prons) should be incorporated into the big scrape workflow. @kylebgorman's draft is [here](https://gist.github.com/kylebgorman/a32fd2c91c862cd508de9b14fbba80dd). * @jacksonllee proposes that...
The German .tsv files only have around 35.000 transcriptions. However, there are certainly more than 600.000 IPA transcriptions in the German Wiktionary. I recently obtained ~ 670.000 IPA transcriptions with...
Wiktionary has entries for several languages and dialects with unofficial codes we can't scrape. Some examples of these include * [Central Franconian](https://en.wiktionary.org/wiki/Category:Central_Franconian_language): `gmw-cfr` * [Old Galician/Portuguese](https://en.wiktionary.org/wiki/Category:Old_Portuguese_language): `roa-opt` * [Westrobothnian](https://en.wiktionary.org/wiki/Category:Westrobothnian_language): `gmq-bot`...
Persian, nonstandardly, [uses ~ to separate variants](https://github.com/kylebgorman/wikipron/blob/master/data/tsv/per_phonemic.tsv#L273). Fix these upstream, and then rescrape.
There are Armenian entries which show optional material in parentheses. For example, any initial sibilant-stop [cluster](https://en.wiktionary.org/wiki/%D5%BD%D5%BF%D5%A1%D5%B6%D5%A1%D5%AC) gets an obligatory schwa in Western Armenian, but an optional one in Eastern Armenian:...
Several languages use ZERO WIDTH SPACE and ZERO WIDTH NON JOINER, which, as the name suggests, aren't real characters. Let's look into why and see whether that's a bug upstream...
As [reported here](https://github.com/sigmorphon/2020/issues/9) there are some inconsistencies with /l/ and the dental stops. As [discussed here](https://en.wiktionary.org/wiki/Wiktionary:Information_desk/2020/April#Performing_bulk_edits, there is a pronunciation module and pron template for Bulgarian on Wiktionary; we may...
Supporting [Min Nan](https://en.wiktionary.org/wiki/Category:Min_Nan_terms_with_IPA_pronunciation) requires writing custom extractor.