ebook-reader-dict
ebook-reader-dict copied to clipboard
Pronunciation output: "colon space" before, "\" and other issues.
I always wonder why we have the "colon space" artifact before the pronunciation.
Wouldn’t it be better to show which phonetic alphabet is shown instead? As in:
IPA: [trɑːnsˈkrɪpʃn̩] X-SAMPA: [trA:ns"krIpSn_=]
(X-SAMPA is often used in Text-to-Speech systems, dictionaries mostly use the IPA.)
I have no idea how many entries in the Wiktionaries are using SAMPA or X-SAMPA, probably only a few. Might still be helpful to show which, don’t you think? Or only take the IPA, but then remove the ": " artifact.
IPA has no backslash, as far as I know. But we still generate things like
: \ˈwɪkʃən(ə)ɹi, \ˈwɪkʃənɹɪ\
which I believe are leftover artifacts from somewhere having quotes escaped.
Traditionally, IPA pronunciation is also enclosed in square brackets (as shown above), but I don’t know the reason for it. Should we adapt that?
EDIT: Found it: https://en.wikipedia.org/wiki/International_Phonetic_Alphabet#Brackets_and_transcription_delimiters
EDIT 2: The word "Wiktionary" (EN) is given as
Pronunciation
(UK) IPA(key): /ˈwɪkʃən(ə)ɹi/, (Received Pronunciation) IPA(key): /ˈwɪkʃənɹɪ/
in the EN WIktionary.
Currently, we show it as:
: \ˈwɪkʃən(ə)ɹi, \ˈwɪkʃənɹɪ\
We should be sure to take the whole definitions (including slashes, brackets, stress marks) into our output, so more like:
IPA: /ˈwɪkʃən(ə)ɹi/, /ˈwɪkʃənɹɪ/
or (without the "IPA: "):
/ˈwɪkʃən(ə)ɹi/, /ˈwɪkʃənɹɪ/
Call me a nitpicker—I just love quality! :-)
I always wonder why we have the "colon space" artifact before the pronunciation.
Leading colon+space are a glitch, they should not be there.
Traditionally, IPA pronunciation is also enclosed in square brackets (as shown above), but I don’t know the reason for it. Should we adapt that?
Backslashes are used in the French Wiktionary, and we used it as a basis for all other dicts. Each locale has its own way of displaying IPAs, so let's go with the brackets for all of them :+1:
Let's also tackle multiple IPAs on the English Wiktionary, thanks for the report :)
Stop… Interesting to know the French use backslashes. And after reading the EN Wiki explanation mentioned above, I’d instead opt for taking what’s there (in the Wiktionary). Including whatever "boundary characters" they use.
Btw, reading it in French (which I don’t speak), it looks like they also use /…/
and […]
? How do your printed dictionaries look like? Interestingly, the French Wiktionary indeed uses backslashes, see the entry for "test".
Rationale: Our dicts should be as professional and usable as possible. Agree? So, if different countries use different symbols, it might possibly be better to use these, sacrificing just a little uniformity.
Since we’re currently producing only reference dictionaries, not translation dictionaries, it might be wise to stick with what the users of each country are used to (and what’s correct for them). A foreigner has to learn what’s correct for the selected language, right? (As he would have to learn the language.) And local users will feel "at home".
Maybe we can get @chopinesque’s feedback on this, since (s)he is a pro user?
Agreed, localization has to do with adapting things to the user's locale. If we have things adapted to their locale and then we somehow "normalize" them to fit our standardized approach, they may not feel perfectly "at home".
For example, the French tend to use non-breaking spaces before a number of characters, including colon (:). This is something we would never do in English. So if we had an Anglocentric normalization approach, all these thin spaces would go.
That said, if we are presenting multilingual data then there may have to be a marriage between locale-specific idiosyncrasies and convenience. At the end of the day, the person(s) making all the effort have to decide whether any extra work required is worth the trouble, or whether they have time for that extra work.
I am +1 on using what is defined by the locale.
I can't reproduce the : in pronunciation. Can anyone explain how to see it ?
I simply downloaded the EN StarDict and looked up the word "Wiktionary" (using GoldenDict on Linux). We have it there. Probably a leftover artifact from removing the "IPA" before the pronunciation, I think.
Mmm weird,
I don't see it with
python -m wikidict en --get-word "Wiktionary" --raw
Interesting. Your command looks good here, too.
But if you have a peek into data/en/dict-en-en.df
, it looks like this:
@ Wiktionary
: \ˈwɪkʃən(ə)ɹi\, \ˈwɪkʃənɹɪ\
<html><p>Blend of <i>wiki</i> + <i>dictionary</i>.</p></br>
<ol><li>A collaborative project run by the Wikimedia Foundation to produce a free and complete dictionary in every language; the dictionaries, collectively, produced by that project.</li><li>A particular version of this dictionary project, written in a certain language, such as the English-language Wiktionary (often known simply as the English Wiktionary).</li></ol>
(for all entries having pronunciation)
Seems the Kobo dicthtml
does also not have the "colon space". Taken from data/en/tmp/wi.raw.html
(beautified):
<w>
<p><a name="Wiktionary" /><b>Wiktionary</b> \ˈwɪkʃən(ə)ɹi\, \ˈwɪkʃənɹɪ\<br /><br />
<p>Blend of <i>wiki</i> + <i>dictionary</i>.</p></br>
<ol>
<li>A collaborative project run by the Wikimedia Foundation to produce a free and complete dictionary in every language; the dictionaries, collectively, produced by that project.</li>
<li>A particular version of this dictionary project, written in a certain language, such as the English-language Wiktionary (often known simply as the English Wiktionary).</li>
</ol>
</p>
</w>
… which brings me to the next bug: </br>
?! Probably a typo, meant to be <br/>
?
It seems an artefact from when PyGloassry is creating the StarDict :thinking:
I guess this is the case since we introduced StarDict support, but as I never used it, it may be gone under our radar.
So we’re a perfect match: I almost exclusively use StarDict! :grin:
The bad </br>
is here: https://github.com/BoboTiG/ebook-reader-dict/blob/9a02781b5f0840520aad2c9def08ba87137bac1c/wikidict/convert.py#L75
Oh good catch! If that is the issue, mind opening a PR? :)
If this bad br went unnoticed for so long, maybe it should be remove ?
We need to check, I guess it depends of the flexibility of the HTML parser.
If it is useless on Kobo too, then let's remove it, yeah.
Ok. I'll file another issue for the BR.
For the colon, we output it correctly in the df file. The colon is necessary there ! See https://pgaskin.net/dictutil/dictgen/ So it's a bug in pyglossary. It should removed the : when parsing the file.
I filed an issue https://github.com/ilius/pyglossary/issues/358
For the colon, we output it correctly in the df file. The colon is necessary there ! See https://pgaskin.net/dictutil/dictgen/
Good catch! So that’s what puts the pronunciation next to the title… (looks odd to me).
So that’s what puts the pronunciation next to the title… (looks odd to me).
The default dictionary on Kobo does that.
I filed an issue https://github.com/ilius/pyglossary/issues/358
Please let me know if it's fixed.