wikipron icon indicating copy to clipboard operation
wikipron copied to clipboard

Massively multilingual pronunciation mining

Results 26 wikipron issues
Sort by recently updated
recently updated
newest added

Do you think there's a reasonable way to make an enhancement that will extract audio file URLs for Wiktionary words? At least for Armenian, the audio files are linked in...

enhancement

Currently we are using the ISO-639-2 "bibliographic" codes ("ger" for German). It seems to me that these are not terribly widely used and make compatibility with other multilingual resources poorer...

enhancement

* A script for computing KPI numbers (languages, dialects, scripts, and number of prons) should be incorporated into the big scrape workflow. @kylebgorman's draft is [here](https://gist.github.com/kylebgorman/a32fd2c91c862cd508de9b14fbba80dd). * @jacksonllee proposes that...

documentation
enhancement
good first issue

The German .tsv files only have around 35.000 transcriptions. However, there are certainly more than 600.000 IPA transcriptions in the German Wiktionary. I recently obtained ~ 670.000 IPA transcriptions with...

bug
language support

Wiktionary has entries for several languages and dialects with unofficial codes we can't scrape. Some examples of these include * [Central Franconian](https://en.wiktionary.org/wiki/Category:Central_Franconian_language): `gmw-cfr` * [Old Galician/Portuguese](https://en.wiktionary.org/wiki/Category:Old_Portuguese_language): `roa-opt` * [Westrobothnian](https://en.wiktionary.org/wiki/Category:Westrobothnian_language): `gmq-bot`...

enhancement
language support

Persian, nonstandardly, [uses ~ to separate variants](https://github.com/kylebgorman/wikipron/blob/master/data/tsv/per_phonemic.tsv#L273). Fix these upstream, and then rescrape.

good first issue
language support

There are Armenian entries which show optional material in parentheses. For example, any initial sibilant-stop [cluster](https://en.wiktionary.org/wiki/%D5%BD%D5%BF%D5%A1%D5%B6%D5%A1%D5%AC) gets an obligatory schwa in Western Armenian, but an optional one in Eastern Armenian:...

enhancement

Several languages use ZERO WIDTH SPACE and ZERO WIDTH NON JOINER, which, as the name suggests, aren't real characters. Let's look into why and see whether that's a bug upstream...

enhancement
good first issue
language support

As [reported here](https://github.com/sigmorphon/2020/issues/9) there are some inconsistencies with /l/ and the dental stops. As [discussed here](https://en.wiktionary.org/wiki/Wiktionary:Information_desk/2020/April#Performing_bulk_edits, there is a pronunciation module and pron template for Bulgarian on Wiktionary; we may...

language support

Supporting [Min Nan](https://en.wiktionary.org/wiki/Category:Min_Nan_terms_with_IPA_pronunciation) requires writing custom extractor.

language support