pyglossary Adding an option for kindle output to work around the bugs in their dictionary lookup algorithm

Adding an option for kindle output to work around the bugs in their dictionary lookup algorithm

Open Vuizur opened this issue 2 years ago • 7 comments

Is your feature request related to a problem? Please describe. Kindle's lookup algorithm has been implemented very badly. The first problem is that you can not turn off fuzzy lookup - the docs say that you can, but this applies only to inflections, not to headwords. The second (even worse) problem is that if the algorithm finds a result among the headwords, it stops searching. This combination leads to stupid behaviour: For example, if you look up the word "osó", which is the past of the Spanish word for "to dare", you only get the dictionary entry for "oso", which means "bear". And nothing else, even though your dictionary correctly contains "osó" as an inflection of "osar". A related problem is that if a word is for example the inflection of two headwords, it only returns the first headword and ignores the second, which is also annoying.

Describe the solution you'd like I have lost the hope that Amazon will ever fix these bugs, as they have apparently existed for more than 11 years. It is possible to create a dictionary that works around them: For each inflection that might conflict with another inflection or with another headword, you simply create a new headword with a duplicated definition. So for osó, we create the headword osó, set the headword HTML to (bolded) osar, and simply copy the definition content from osar.

The result is that we get a dictionary that is not really slower (as far as I could tell), but always finds all relevant headwords and is simply a much better experience.

I made an attempt to implement this in my function here. This solution works really well for the Spanish dictionary I generated. It currently uses unidecode, but this is a bad idea for languages other than Spanish, so that would have to be replaced by a generic function that simply removes all diacritics in a unicode string.

It also uses a replaced version of a Pyglossary function to support the setting of the headword HTML independently of the headword, but this "patching" is of course a hacky solution, so I don't know how one would properly model this to fit into the PyGlossary architecture. So if you give me some pointers I could also try to open a Pull Request.

May 16 '22 10:05 Vuizur

It also uses a replaced version of a Pyglossary function to support the setting of the headword HTML independently of the headword

Where can I see your changes to PyGlossary? I can't find any fork on your account.

May 18 '22 04:05 ilius

It is in the function I pasted in above my own code.

May 18 '22 08:05 Vuizur

You don't use this function in your repo. And you seem to have changed GROUP_XHTML_WORD_DEFINITION_TEMPLATE, but again not in that repo.

May 20 '22 07:05 ilius

I took the format_group_content function I pasted in the linked file and replaced it with the version in the site-packages folder in my venv (I know, this is probably quite stupid, but at least it worked for me locally).

~~I didn't change anything else I think. I checked GROUP_XHTML_WORD_DEFINITION_TEMPLATE and it is the same on my end as in the pyglossary current repo.~~

May 20 '22 08:05 Vuizur

Oups, sorry, I really changed it:

	GROUP_XHTML_WORD_DEFINITION_TEMPLATE = """<idx:entry \
scriptable="yes"{spellcheck_str}>
<idx:orth{headword_hide}>{headword_html}{infl}
</idx:orth>
<br/>{definition}
</idx:entry>
<hr/>"""

May 20 '22 11:05 Vuizur

I worked more on this and created a fork with the changes that allow setting a completely separate HTML for each word: https://github.com/Vuizur/pyglossary. This fork should be fully compatible with the normal usage, only if a lemma/inflection list begins with the string "HTML_HEAD", the following HTML is displayed, and the next entry is then used as the value_headword.

The steps that would be left to get it working is finding a way to add a kindle generation option like "fix_kindle_not_finding_inflections" and then to execute a function that converts the input glossary to an intermediate "fixed" glossary, like it is done here: https://github.com/Vuizur/pyglossary-kindle-test/blob/master/pyglossary_kindle_test/edit_dictionary.py#L38 (I only have not found a way to iterate over the glossary data itself, so I used a list which has essentially the same structure.) The project under https://github.com/Vuizur/pyglossary-kindle-test shows how to convert a tabfile to a a fixed kindle dictionary.

Jul 26 '22 13:07 Vuizur

Can you create a pull request?

Jul 30 '22 15:07 ilius

pyglossary pyglossary copied to clipboard

Adding an option for kindle output to work around the bugs in their dictionary lookup algorithm

pyglossary
pyglossary copied to clipboard