f2e-spec icon indicating copy to clipboard operation
f2e-spec copied to clipboard

Accented characters sorting problem

Open s-k-y-l-i opened this issue 8 years ago • 18 comments

Now: ABCDEFGHIJKLMNOPQRSTZUVWXYZÁÖÜ... It should be: AÁBCDEÉFGHIÍJKLMNOÓÖŐPQRSTUÚÜŰVQXYZ

s-k-y-l-i avatar Jun 15 '16 19:06 s-k-y-l-i

This has to do with the way UTF-8 works. :tired_face: UTF-8 is just ascii (7 bits) so it sorts in that order, only when the first bit is 1 sonething special happens. the first byte is the header that shows how much "bytes" long the unicode character is, so: [11100000] means that you get 3 bytes that form a number, anyway this guy can explain it better than me: https://www.youtube.com/watch?v=MijmeoH9LT4 so because all (accented) unicode characters are higher than the ascii symbols they appear higher in the list when sorting. this isn't really a performous specific bug i'm afraid, more a general problem with the way computers work.

nieknooijens avatar Jun 15 '16 19:06 nieknooijens

Still, note that there are conversion tables in Unicode from accented to unaccented characters, so the sorting could be performed using a hidden simplified name.

See e.g. https://pypi.python.org/pypi/Unidecode (for Python).

Incidentally, Spanish would too benefit from this.

mosteo avatar Jun 15 '16 19:06 mosteo

Yeah... bit our lazy byte-wise sorting doesn't take that into account 😅😅

nieknooijens avatar Jun 16 '16 07:06 nieknooijens

So what will we do then?

Baklap4 avatar Jan 22 '17 15:01 Baklap4

is it worth the effort to include an unicode conversion table?

nieknooijens avatar Jan 22 '17 16:01 nieknooijens

A lot of countries use them. Would be nice to have :)

Baklap4 avatar Jan 22 '17 16:01 Baklap4

is this still an issue with current master (using ICU for text processing now)? @Lord-Kamina

Baklap4 avatar Mar 19 '18 22:03 Baklap4

ICU should actually fix this, so we should test it and then close this issue.

nieknooijens avatar Mar 20 '18 07:03 nieknooijens

I know it has the option to include accents in sorting but I don't think it does it by default. So we should test.

Also, I don't know think we're currently using ICU in sorting

On March 20, 2018 at 04:38:51, niek nooijens ([email protected]) wrote:

ICU should actually fix this, so we should test it and then close this issue.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/performous/performous/issues/187#issuecomment-374502851, or mute the thread https://github.com/notifications/unsubscribe-auth/AGJfLIcUG5Tc0l-X7WAcLxal_rqjjmmPks5tgLIKgaJpZM4I2sM0 .

Lord-Kamina avatar Mar 20 '18 10:03 Lord-Kamina

I just confirmed Sorting is not working as explained above.

I have 3 songs with artist:

  • Aoulevard des airs
  • Boulevard des airs
  • Áoulevard des airs

it's sorted like the list above while being sorted on artist Searching on the accented characters works though (#162)

Baklap4 avatar Mar 29 '18 19:03 Baklap4

I'll look into how I can fix it later.

On March 29, 2018 at 16:05:10, Arjan Spieard ([email protected]) wrote:

I just confirmed Sorting is not working as explained above.

I have 3 songs with artist:

  • Aoulevard des airs
  • Boulevard des airs
  • Áoulevard des airs

it's sorted like the list above while being sorted on artist Searching on the accented characters works though (#162 https://github.com/performous/performous/issues/162)

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/performous/performous/issues/187#issuecomment-377339871, or mute the thread https://github.com/notifications/unsubscribe-auth/AGJfLAWfVY0XJEgbdyz4FG5DIaY71XpKks5tjTBlgaJpZM4I2sM0 .

Lord-Kamina avatar Mar 29 '18 19:03 Lord-Kamina

I don't think this was ever looked into? @Lord-Kamina

Baklap4 avatar Apr 09 '20 19:04 Baklap4

Nope, but ICU has a collator that can be set according to different language rules. I can't look it up now though because I'm at work

Lord-Kamina avatar Apr 09 '20 19:04 Lord-Kamina

So the basic idea is that you sort by keys that are plain ASCII and formed by a collate function out of the original UTF-8 so that it sorts correctly using bytewise comparisons (e.g. using std::map with std::string collate keys). All very easy so far, and probably already be implemented (for case-insensitive ordering in English). A bit of a problem is that all this should follow the locale that Performous is running in because the order is different in different languages (e.g. in Finnish Ä and Ö come after Z and are not equal to A and O).

Tronic avatar Apr 10 '20 09:04 Tronic

I'll have to check the documentation but I'm fairly sure it's already implemented

Lord-Kamina avatar Apr 10 '20 12:04 Lord-Kamina

http://userguide.icu-project.org/collation

ICU has the API, we just need to implement it.

IMO, it should be part of #524

Lord-Kamina avatar May 17 '20 07:05 Lord-Kamina

I wouldn't include it in #524 since its a complete new feature, toggling language

Baklap4 avatar May 18 '20 11:05 Baklap4

I wouldn't include it in #524 since its a complete new feature, toggling language

I meant because, IMO at least, we should sort as-per the language performous is currently running in.

Lord-Kamina avatar May 18 '20 13:05 Lord-Kamina