quick-lookup icon indicating copy to clipboard operation
quick-lookup copied to clipboard

Parse Wiktionary's HTML directly instead of using the API

Open johnfactotum opened this issue 1 year ago • 0 comments

Wiktionary's definition API is buggy and lacking in features. Parsing the HTML seems like a better option. And AFAICT the API seems to be also just parsing the HTML.

The main problem is how to identify the language for each section. The API actually does a poor job on this. It's obvious that it's relying on a small, hardcoded name -> language code table, which would explain why uncommon languages (such as Old Norse) are not furnished with the proper code but instead get dumped in a field called other (see for example https://en.wiktionary.org/api/rest_v1/page/definition/rannsaka). Since we already know the target language code, it would be better if one could get a display name using the Intl API and use that to match against the headings. Another option would be to look for the .headword class with the desired lang attribute, but this won't work in cases where the headword template is absent (e.g. the example in https://github.com/johnfactotum/quick-lookup/issues/14#issuecomment-745098811 which only has a single {{see-ja}} template and nothing else).

The above only applies to the English instance of Wiktionary. It would be good to support other Wiktionaries as well. The good news is that French and Spanish Wiktionary both use templates (https://fr.wiktionary.org/wiki/Mod%C3%A8le:langue and https://es.wiktionary.org/wiki/Plantilla:lengua) for the language headings, and they have ID attributes set to the language code.

Russian, Japanese, and Esperanto Wiktionary also use templates for the language headings. But the rendered HTML does not contain the language code so it's not useful at all. One needs to either copy from the template's source code or use the display name API.

After mapping each heading to a language, one now only has to associate all other sibling elements to the nearest preceding heading, and the rest should be relatively straightforward.

johnfactotum avatar Nov 15 '22 18:11 johnfactotum