wikipron
wikipron copied to clipboard
Massively multilingual pronunciation mining
I noticed this problem for [Armenian](https://en.wiktionary.org/wiki/%D5%A5%D6%80%D5%AF%D6%80%D5%B8%D6%80%D5%A4) and a colleague told me it's also found in [Portuguese](https://en.wiktionary.org/wiki/afetar). For some languages, the pronunciation entry can use a nested list, such that *...
In Wiktionary, German includes Swiss German and Germany German as its dialects. These "dialects" are each labeled with `(Standard German of Germany)` and `(Standard German of Switzerland)` or `Swiss German`....
- [x] Updated `Unreleased` in `CHANGELOG.md` to reflect the changes in code or data. `ger` is split into Swiss German and Germany German. `Nep` duplicate is removed from `languages.json`
In the Slovenian data, some of the vowels with tone marking (e.g. /é/) are transcribed using the precomposed characters (so here, [U+00E9](https://www.compart.com/en/unicode/U+00E9) instead of the sequence [U+0065](https://www.compart.com/en/unicode/U+0065) [U+0301](https://www.compart.com/en/unicode/U+0301)). The module...
Though Lithuanian is generally said to have a relatively shallow orthography, there are some apparent inconsistencies in how _ie_ is transcribed, as well as issues in the use of dental...
As of at least #509 the custom selector for Latin has been broken. Latin has a [custom selector](src/wikipron/extract/lat.py) because the headwords lack macrons. Now the Romans of course didn't use...
The last big scrape was completed in March 2022. This is a tracking bug for a fall 2023 big scrape, which I am assigning to myself. Modulo issues discussed in...
We have no effective documentation for the [covering grammars](https://github.com/CUNY-CL/wikipron/tree/master/data/covering_grammar) data library. * We should probably add a short description to the [data README](https://github.com/CUNY-CL/wikipron/blob/master/data/README.md). * We should give the exact instructions...
The commandline lets the user choose to apply casefolding so that entries like `English` can be changed to either `English` or `english`. But for the scraped data on the repo,...