moin icon indicating copy to clipboard operation
moin copied to clipboard

Index page don't apply collation for latin diacritic

Open ThomasWaldmann opened this issue 10 years ago • 5 comments

Original report by fabrice salvaire (Bitbucket: fabricesalvaire, GitHub: fabricesalvaire).


Page starting by È are placed at the end of the index instead of E.

ThomasWaldmann avatar Dec 20 '15 15:12 ThomasWaldmann

Original comment by RogerHaase (Bitbucket: RogerHaase, GitHub: RogerHaase).


Maybe some useful ideas here:

Paul Boddie [email protected] To [email protected] Today at 3:48 AM On Monday 4. September 2017 11.27.42 Lars Kruse wrote:

Am Mon, 04 Sep 2017 09:22:01 +0200 schrieb Volker Wysk <post@volker- wysk.de>:

I mean, could one locale be fitting for a different one too, as far as sorting is concerned?

As far as I understand locales: no. (if someone knows better: please correct me)

I'm not sure whether I really understand locales better, but here are a few things that might help. Firstly, you can get the default locale as follows:

import locale locale.setlocale(locale.LC_ALL, "") # returns the locale string

This has to be done to make the process's locale information available. It is possible that something does this already in Moin, but as mentioned before, it is questionable that the process's locale is relevant for a user of a Web application. Now you can get the locale details more conveniently.

For example, to ask for the collation:

language, charset = locale.getlocale(locale.LC_COLLATE)

I would think that the collation is the most pertinent locale setting when it comes to sorting things. So, it might be more interesting to set this based on any details about the user provided by Moin. The MoinMoin.user.User object has a language attribute that could work in principle, but I'm not convinced that this is enough by itself. More on that in a moment.

Anyway, you can set the collation as follows:

locale.setlocale(locale.LC_COLLATE, "no_NO") # something I just tested

And you can apply the locale sorting as follows:

names.sort(cmp=locale.strcoll)

This will correctly sort a sequence of names where Norwegian letters are used. It seems that Unicode will work, too.

Why I don't think the Moin language code is enough is that the locale system is rather particular about what you ask it for. However, it seems that you can get a proper locale from the Moin language as follows:

language = request.user.language # will probably work given a request localename = locale.normalize(language)

For me, this yielded "no_NO.ISO8859-1" from "no".

A few problems emerge when using locale support for sorting. Firstly, you need to have the necessary locales installed for the functions to work. Secondly, switching locales affects the entire program, so you have to be careful not to cause side-effects, although this is less of a problem in a plain CGI environment.

Another thing noted earlier is that locales are language specific, so if your list of page names contains both German names and names using non-German characters, the sorting of those other characters may not be as desired. Libraries like ICU might try and reconcile different collations, but it is probably an open-ended problem. Bindings for Python are available here:

https://pypi.python.org/pypi/PyICU/

The documentation for the locale functionality is found here:

https://docs.python.org/2.7/library/locale.html

Paul

ThomasWaldmann avatar Sep 04 '17 14:09 ThomasWaldmann

Setting the locale per request is likely not advisable as the stuff in locale module is not thread-safe.

But maybe we don't need, see this:

>>> locale.setlocale(locale.LC_ALL, "en_US.UTF-8")
'en_US.UTF-8'

>>> sorted(l, cmp=locale.strcoll)
[u'a', u'\xe0', u'b', u'B']

>>> sorted(l)
[u'B', u'a', u'b', u'\xe0']

As you see, even with a en_US locale, the sorted result (based on locale.strcoll) is way more acceptable than the the simple sorted result. The hex char was an accented lowercase a.

ThomasWaldmann avatar Sep 06 '18 19:09 ThomasWaldmann

@fabricesalvaire what do you think, would that be good enough?

ThomasWaldmann avatar Sep 06 '18 19:09 ThomasWaldmann

Hmm, I tried setting LC_ALL and LANG to en_US.UTF-8, then started moin (with the builtin server).

I tried a modified PagenamesList macro, using sort(cmp=locale.strcoll), but it did not change the sort order in the expected way.

We use flask-babel, maybe there is some interference from that, but I didn't find anything about sorting in babel docs.

ThomasWaldmann avatar Sep 13 '18 20:09 ThomasWaldmann

https://stackoverflow.com/questions/11121636/sorting-list-of-string-with-specific-locale-in-python

ThomasWaldmann avatar Sep 13 '18 20:09 ThomasWaldmann