hayagriva icon indicating copy to clipboard operation
hayagriva copied to clipboard

CJK sorting is based on unicode code points

Open quachpas opened this issue 1 year ago • 4 comments

When the CSL requires author-date sorting, e.g., gb-7714-2015-author-date, then characters need to be romanized before sorting, otherwise the default is sorting by code points.

let p8 = Person::from_strings(vec!["王", "一"]).unwrap();
let p8r = Person::from_strings(vec!["wang", "yi"]).unwrap();
let p9 = Person::from_strings(vec!["王", "二"]).unwrap();
let p9r = Person::from_strings(vec!["wang", "er"]).unwrap();

// 一 < 二
// yī > èr
// U+4E00 < U+4E8C
assert_eq!(Ordering::Less, p8.csl_cmp(&p9, LongShortForm::Long, false));
assert_eq!(Ordering::Greater, p8r.csl_cmp(&p9r, LongShortForm::Long, false));

// 大 < 安 < 晨 < 白
// dà < ān < Zhāng < bǎi
// U+5927 < U+5B89 < U+6668 < U+767D
let p1 = Person::from_strings(vec!["大", "大"]).unwrap();
let p2 = Person::from_strings(vec!["安", "安"]).unwrap();
let p3 = Person::from_strings(vec!["晨", "晨"]).unwrap();
let p4 = Person::from_strings(vec!["白", "白"]).unwrap();
assert_eq!(Ordering::Less, p1.csl_cmp(&p2, LongShortForm::Long, false));
assert_eq!(Ordering::Less, p2.csl_cmp(&p3, LongShortForm::Long, false));
assert_eq!(Ordering::Less, p3.csl_cmp(&p4, LongShortForm::Long, false));

image

Discord thread

EDIT: Probably identical issue could occur for non-latin script languages

quachpas avatar Dec 12 '24 13:12 quachpas

The CSL spec doesn't seem to enforce a specific standard, although sorting by codepoint is probably a bad default for non-Latin script languages. Though an initial search reveals that there are multiple standards for sorting CJK characters (and also Chinese, Japanese and Korean characters separately). Romanization is one of them, though I've also seen mentions of character form-based sorting. I wonder if we should add some way to support those different sorting options, or if we could at least settle on an initial solution of just changing the default to use one of them (e.g. romanization) and ensuring you can still specify your own order (which can be done through the CSL style).

PgBiel avatar Feb 01 '25 18:02 PgBiel

I've discussed that with Chinese colleagues and it seems like the common thing to do is to romanize before sorting, so I'd consider current behaviour a bug. (Could be mistaken, so probably best to check with other people)

I think we can introduce unicode sorting later on if there is demand (and alternative sorting options).

quachpas avatar Feb 02 '25 07:02 quachpas

This problem can be handled by Unicode Collation Algorithm and there should be several implementations in Rust like https://github.com/unicode-org/icu4x. The sorting methods for Chinese are defined in https://github.com/unicode-org/cldr/blob/main/common/collation/zh.xml.

zepinglee avatar Feb 02 '25 17:02 zepinglee

Workround proposed by r

Prepend rare unicode codepoints to the authors, and show them as none.

Image Image

Besides, #314 might fix this.

YDX-2147483647 avatar Oct 16 '25 11:10 YDX-2147483647