glosario icon indicating copy to clipboard operation
glosario copied to clipboard

2020-12-15: Japanese glosario page is not alphabetical order

Open masamiy opened this issue 3 years ago • 8 comments

https://carpentries.github.io/glosario/ja/ lists Japanese entries based on the first character of the entry. It means that the entries are not categorised by Japanese alphabet (nor English alphabet), but characters. The last entry, 'function', should be top of the current list as it is read as 'kansuu' if terms are categorise by Japanese alphabet. As there are 46+ characters in Japanese alphabet, I feel we need to have some indexing strategy.

masamiy avatar Dec 15 '20 06:12 masamiy

@masamiy

The order of the entries is determined by a sort function on line 16 of _includes/glossary.html, which operates on individual characters. It may be that for languages such as Japanese we need to find a different solution entirely. The sort function currently being used is a liquid one, and I very much doubt they have a different one that will sort Japanese correctly. I am familiar with the website infrastructure and the code, but I don't know the Japanese alphabet, so this isn't something I can fix on my own.

I can see two options for solving it:

  1. We find or write a function in something like Ruby or Python to sort Japanese (and any other language that has this problem), based on an input list of the alphabet, if need be.
  2. We move to a different system for storing definitions other than a YAML file so that the sorting can take place at a slightly different step. An example would be an SQLite database which can export its contents, or part of its contents, as a YAML or other config-type file. This involves more changes to the infrastructure, though.

Perhaps @fmichonneau or @gvwilson will have another idea?

baileythegreen avatar Dec 15 '20 10:12 baileythegreen

It looks like option 1 is going to be the way to go. From a quick search, I saw mecab being mentioned regularly but that's Japanese-specific and wouldn't work for Arabic, Hebrew, Amharic, etc. From my limited understanding of this, I think the ICU library would order the characters correctly. In R, it's implemented by the stringi/stringr packages, in Python, by PyICU.

fmichonneau avatar Dec 15 '20 15:12 fmichonneau

I think option 1 is certainly easier to implement in the short-term. I can take a stab at writing Python code to do this, though I may need someone to verify the output in those languages.

If I do this, unless someone has an objection, I'll probably try to remove the sort logic from _includes/glossary.html entirely and use one script to do all alphabetising, rather than have it happen in different places based on the language in question.

baileythegreen avatar Dec 15 '20 15:12 baileythegreen

Hi @baileythegreen @fmichonneau , Thank you for your attention and suggestions. A new sort logic will definitely help for non-alphabet languages. I am happy to check Japanese output. Please let me know if there is anything I can help.

masamiy avatar Dec 15 '20 23:12 masamiy

@masamiy It'll probably take me a couple of days to get to it because I have some deadlines coming up, but I'll tag you when I do, unless @fmichonneau beats me to it.

baileythegreen avatar Dec 15 '20 23:12 baileythegreen

Take your time :)

masamiy avatar Dec 16 '20 00:12 masamiy

@masamiy I think the issue is a mixture of Romaji, Katakana, and Kanji in the terms defined. It's sorting them correctly (as expected for this).

I see two solutions:

  1. Give the terms in Hiragana first and it will sort by them. This could make searching them difficult (do the packages support partial matches.

  2. write a custom script that sorts differently depending on the language (as proposed above). There should be existing solutions for sorting Japanese characters but I think it's working as expected now.

Either way furigana (kanji readings) would need to be supported to sort by them and added for each entry (for option No. 2 this would be a need a new slot I think).

You cannot parse furigana from Kanji automatically (although some databases already exist). I think it is easier to specify the intended reading for each entry.

TomKellyGenetics avatar Dec 16 '20 02:12 TomKellyGenetics

Regarding the order of the entries, the languages on the homepage may need to be changed as well (this is done manually as I understand it).

Sorry this may need it's own issue. (See #259)

TomKellyGenetics avatar Dec 16 '20 03:12 TomKellyGenetics