ebook-reader-dict Pronunciation output: "colon space" before, "\" and other issues.

I always wonder why we have the "colon space" artifact before the pronunciation.

Wouldn’t it be better to show which phonetic alphabet is shown instead? As in:

IPA: [trɑːnsˈkrɪpʃn̩] X-SAMPA: [trA:ns"krIpSn_=]

(X-SAMPA is often used in Text-to-Speech systems, dictionaries mostly use the IPA.)

I have no idea how many entries in the Wiktionaries are using SAMPA or X-SAMPA, probably only a few. Might still be helpful to show which, don’t you think? Or only take the IPA, but then remove the ": " artifact.

IPA has no backslash, as far as I know. But we still generate things like

: \ˈwɪkʃən(ə)ɹi, \ˈwɪkʃənɹɪ\

which I believe are leftover artifacts from somewhere having quotes escaped.

Traditionally, IPA pronunciation is also enclosed in square brackets (as shown above), but I don’t know the reason for it. Should we adapt that?

EDIT: Found it: https://en.wikipedia.org/wiki/International_Phonetic_Alphabet#Brackets_and_transcription_delimiters

EDIT 2: The word "Wiktionary" (EN) is given as

Pronunciation

    (UK) IPA(key): /ˈwɪkʃən(ə)ɹi/, (Received Pronunciation) IPA(key): /ˈwɪkʃənɹɪ/

in the EN WIktionary.

Currently, we show it as:

: \ˈwɪkʃən(ə)ɹi, \ˈwɪkʃənɹɪ\

We should be sure to take the whole definitions (including slashes, brackets, stress marks) into our output, so more like:

IPA: /ˈwɪkʃən(ə)ɹi/, /ˈwɪkʃənɹɪ/

or (without the "IPA: "):

/ˈwɪkʃən(ə)ɹi/, /ˈwɪkʃənɹɪ/

Call me a nitpicker—I just love quality! :-)

Jan 31 '22 09:01 Moonbase59

I always wonder why we have the "colon space" artifact before the pronunciation.

Leading colon+space are a glitch, they should not be there.

Traditionally, IPA pronunciation is also enclosed in square brackets (as shown above), but I don’t know the reason for it. Should we adapt that?

Backslashes are used in the French Wiktionary, and we used it as a basis for all other dicts. Each locale has its own way of displaying IPAs, so let's go with the brackets for all of them :+1:

Let's also tackle multiple IPAs on the English Wiktionary, thanks for the report :)

Jan 31 '22 10:01 BoboTiG

Stop… Interesting to know the French use backslashes. And after reading the EN Wiki explanation mentioned above, I’d instead opt for taking what’s there (in the Wiktionary). Including whatever "boundary characters" they use.

Btw, reading it in French (which I don’t speak), it looks like they also use /…/ and […]? How do your printed dictionaries look like? Interestingly, the French Wiktionary indeed uses backslashes, see the entry for "test".

Rationale: Our dicts should be as professional and usable as possible. Agree? So, if different countries use different symbols, it might possibly be better to use these, sacrificing just a little uniformity.

Since we’re currently producing only reference dictionaries, not translation dictionaries, it might be wise to stick with what the users of each country are used to (and what’s correct for them). A foreigner has to learn what’s correct for the selected language, right? (As he would have to learn the language.) And local users will feel "at home".

Maybe we can get @chopinesque’s feedback on this, since (s)he is a pro user?

Jan 31 '22 10:01 Moonbase59

Agreed, localization has to do with adapting things to the user's locale. If we have things adapted to their locale and then we somehow "normalize" them to fit our standardized approach, they may not feel perfectly "at home".

For example, the French tend to use non-breaking spaces before a number of characters, including colon (:). This is something we would never do in English. So if we had an Anglocentric normalization approach, all these thin spaces would go.

That said, if we are presenting multilingual data then there may have to be a marriage between locale-specific idiosyncrasies and convenience. At the end of the day, the person(s) making all the effort have to decide whether any extra work required is worth the trouble, or whether they have time for that extra work.

Jan 31 '22 10:01 chopinesque

I am +1 on using what is defined by the locale.

Jan 31 '22 11:01 BoboTiG

I can't reproduce the : in pronunciation. Can anyone explain how to see it ?

Jan 31 '22 19:01 lasconic

I simply downloaded the EN StarDict and looked up the word "Wiktionary" (using GoldenDict on Linux). We have it there. Probably a leftover artifact from removing the "IPA" before the pronunciation, I think.

Jan 31 '22 20:01 Moonbase59

Mmm weird,

I don't see it with

python -m wikidict en --get-word "Wiktionary" --raw

Feb 01 '22 10:02 lasconic

Interesting. Your command looks good here, too.

But if you have a peek into data/en/dict-en-en.df, it looks like this:

@ Wiktionary
: \ˈwɪkʃən(ə)ɹi\, \ˈwɪkʃənɹɪ\ 
<html><p>Blend of <i>wiki</i>&nbsp;+&nbsp;<i>dictionary</i>.</p></br>
<ol><li>A collaborative project run by the Wikimedia Foundation to produce a free and complete dictionary in every language; the dictionaries, collectively, produced by that project.</li><li>A particular version of this dictionary project, written in a certain language, such as the English-language Wiktionary (often known simply as the English Wiktionary).</li></ol>

(for all entries having pronunciation)

Seems the Kobo dicthtml does also not have the "colon space". Taken from data/en/tmp/wi.raw.html (beautified):

<w>
  <p><a name="Wiktionary" /><b>Wiktionary</b> \ˈwɪkʃən(ə)ɹi\, \ˈwɪkʃənɹɪ\<br /><br />
  <p>Blend of <i>wiki</i>&nbsp;+&nbsp;<i>dictionary</i>.</p></br>
  <ol>
    <li>A collaborative project run by the Wikimedia Foundation to produce a free and complete dictionary in every language; the dictionaries, collectively, produced by that project.</li>
    <li>A particular version of this dictionary project, written in a certain language, such as the English-language Wiktionary (often known simply as the English Wiktionary).</li>
  </ol>
  </p>
</w>

… which brings me to the next bug: </br>?! Probably a typo, meant to be <br/>?

Feb 01 '22 10:02 Moonbase59

It seems an artefact from when PyGloassry is creating the StarDict :thinking:

I guess this is the case since we introduced StarDict support, but as I never used it, it may be gone under our radar.

Feb 01 '22 10:02 BoboTiG

So we’re a perfect match: I almost exclusively use StarDict! :grin:

Feb 01 '22 10:02 Moonbase59

The bad </br> is here: https://github.com/BoboTiG/ebook-reader-dict/blob/9a02781b5f0840520aad2c9def08ba87137bac1c/wikidict/convert.py#L75

Feb 01 '22 10:02 Moonbase59

Oh good catch! If that is the issue, mind opening a PR? :)

Feb 01 '22 11:02 BoboTiG

If this bad br went unnoticed for so long, maybe it should be remove ?

Feb 01 '22 11:02 lasconic

We need to check, I guess it depends of the flexibility of the HTML parser.

If it is useless on Kobo too, then let's remove it, yeah.

Feb 01 '22 12:02 BoboTiG

Ok. I'll file another issue for the BR.

For the colon, we output it correctly in the df file. The colon is necessary there ! See https://pgaskin.net/dictutil/dictgen/ So it's a bug in pyglossary. It should removed the : when parsing the file.

Feb 01 '22 12:02 lasconic

I filed an issue https://github.com/ilius/pyglossary/issues/358

Feb 01 '22 12:02 lasconic

For the colon, we output it correctly in the df file. The colon is necessary there ! See https://pgaskin.net/dictutil/dictgen/

Good catch! So that’s what puts the pronunciation next to the title… (looks odd to me).

Feb 01 '22 13:02 Moonbase59

So that’s what puts the pronunciation next to the title… (looks odd to me).

The default dictionary on Kobo does that.

Feb 03 '22 13:02 lasconic

I filed an issue https://github.com/ilius/pyglossary/issues/358

Please let me know if it's fixed.

Feb 04 '22 10:02 ilius

ebook-reader-dict ebook-reader-dict copied to clipboard

Pronunciation output: "colon space" before, "\" and other issues.

ebook-reader-dict
ebook-reader-dict copied to clipboard