wiktextract icon indicating copy to clipboard operation
wiktextract copied to clipboard

[website] Inconsistent pagination

Open daxida opened this issue 1 month ago • 5 comments

I have a simple function to get links from Kaikki when given a word, an edition and a target language that I just realized it does not work because the pagination is sometimes (most of the time), language-dependent.

Consider:

  • https://kaikki.org/dictionary/German/meaning/R/Ro/Rock.html
  • https://kaikki.org/dictionary/English/meaning/R/Ro/Rock.html
  • https://kaikki.org/elwiktionary/Greek/meaning/τ/τρ/τρέχω.html
  • https://kaikki.org/dictionary/Greek/meaning/τ/τρ/τρέχω.html

but, and this is what happens for any other language that I tried other than those two:

  • https://kaikki.org/dewiktionary/Deutsch/meaning/R/Ro/Rock.html
  • https://kaikki.org/dewiktionary/Neugriechisch/meaning/τ/τρ/τρίτος.html

instead of German/Greek.

It makes sense for English, but why does Greek have the language names in English?

I would rather see English used everywhere instead of changing Greek for Ελληνικά in the Greek edition, but I am not entirely unbiased. Let me know what you think. It should be possible, otherwise I don't know how the Greek edition does it...

I am aware that I can use the "All%20languages%20combined" at that position, but it adds noise when it comes to debugging.

It would also be nice to change dictionary to enwiktionary but iirc that was rejected in some other issue.

daxida avatar Nov 21 '25 09:11 daxida

lang_name at here https://github.com/tatuylonen/wiktextract/blob/01fc53eff7d40fa7187e656439d58bed1692d32e/src/wiktextract/extractor/el/page.py#L117

should use the language section title text or change this line: https://github.com/tatuylonen/wiktextract/blob/01fc53eff7d40fa7187e656439d58bed1692d32e/src/wiktextract/extractor/el/page.py#L311

to code_to_name(lang_code, "el")

xxyzz avatar Nov 21 '25 09:11 xxyzz

Or the other editions should have used the English language names instead of native language names. Either way, I don't think it's too late to change the Greek one (instead of changing a bunch of other extractors).

kristian-clausal avatar Nov 21 '25 10:11 kristian-clausal

Ok, turns out changing the Greek edition to use Greek names is more annoying than I thought, I'll do it later when I have time.

EDIT: Or for consistency with the original edition we could change the data (which is presented in English) into English.

kristian-clausal avatar Nov 21 '25 10:11 kristian-clausal

This field is defined to have localized name... https://github.com/tatuylonen/wiktextract/blob/01fc53eff7d40fa7187e656439d58bed1692d32e/src/wiktextract/extractor/el/models.py#L234

I guess I saved the original localized name to lang field probably because the language code is converted from language name or template argument, but some language names may not be able to be converted to a code, and add the original text is slightly better than an "unknown" value.

xxyzz avatar Nov 21 '25 10:11 xxyzz

Yeah, I copy pasted that from your extractors so the description was left. Of course didn't notice that at the time or register what it would mean.

kristian-clausal avatar Nov 21 '25 10:11 kristian-clausal