pronouncingpy
pronouncingpy copied to clipboard
extra non-phones in phones for several words
A handful of words have extra non-phone content in their pronunciations, e.g. [(k, v) for k, v in pr.pronunciations if '#' in v] evaluates to...
[("d'artagnan", 'D AH0 R T AE1 NG Y AH0 N # foreign french'),
('danglar', 'D AH0 NG L AA1 R # foreign french'),
('danglars', 'D AH0 NG L AA1 R Z # foreign french'),
('gdp', 'G IY1 D IY1 P IY1 # abbrev'),
('hiv', 'EY1 CH AY1 V IY1 # abbrev'),
('porthos', 'P AO0 R T AO1 S # foreign french'),
('spieth', 'S P IY1 TH # name'),
('spieth', 'S P AY1 AH0 TH # old')]
This should obviously not be the case! There may be other instances like this—I haven't had time to check. I imagine it's a problem with the upstream module providing the pronunciations.
I see @davidlday has fixed cmudict in https://github.com/prosegrinder/python-cmudict/pull/14, released in the latest cmudict 0.4.3. Now, calling cmudict.entries() returns the data without comments. 👍
However, the problem remains in Pronouncing because it's loading the data file directly and doesn't post-process out the comments.
Here's a quick hack to fix Pronouncing:
diff --git a/pronouncing/__init__.py b/pronouncing/__init__.py
index b5f8d0e..20ceddb 100755
--- a/pronouncing/__init__.py
+++ b/pronouncing/__init__.py
@@ -47,15 +47,14 @@ def init_cmu(filehandle=None):
"""
global pronunciations, lookup, rhyme_lookup
if pronunciations is None:
- if filehandle is None:
- filehandle = cmudict.dict_stream()
- pronunciations = parse_cmu(filehandle)
- filehandle.close()
+ pronunciations = cmudict.entries()
lookup = collections.defaultdict(list)
for word, phones in pronunciations:
+ phones = " ".join(phones)
lookup[word].append(phones)
rhyme_lookup = collections.defaultdict(list)
for word, phones in pronunciations:
+ phones = " ".join(phones)
rp = rhyming_part(phones)
if rp is not None:
rhyme_lookup[rp].append(word)
A couple of issues with this quick hack.
-
It changes the API of
init_cmu. Or rather, it ignores thefilehandleparameter completely. Perhaps that's fine if the file duties are delegated to cmudict. (See alsoparse_cmuwhich takes a file handle.) -
Different returns means extra processing, possibly a performance hit:
cmudict.entries()returns a(str, list)tuple (eg.'bout ['B', 'AW1', 'T'])parse_cmu()returns a(str, str)tuple (eg.'bout B AW1 T)
Hmm, I think I'd prefer a solution that retains the ability to load custom cmudict-formatted data directly—I have used this feature a handful of times in my own projects.
Please see PR #53 to strip comments and retain the API.