elasticlunr-rs icon indicating copy to clipboard operation
elasticlunr-rs copied to clipboard

Index created by elasticlunr-rs doesn't work with elasticlunr.js for characters that can't be represented by a single UTF-16 Code Unit

Open Sunshine40 opened this issue 8 months ago • 0 comments

https://github.com/mattico/elasticlunr-rs/blob/29d97e4c8e91bb0d1813716fb2d1575066344d76/src/inverted_index.rs#L40-L42

During index building, elasticlunr-rs iterates over the token &str's content in Unicode Scalar Values.

While the JS library does it in this way:

elasticlunr.InvertedIndex.prototype.addToken = function (token, tokenInfo, root) {
  var root = root || this.root,
      idx = 0;

  while (idx <= token.length - 1) {
    var key = token[idx];

The JS string is actually iterated in UTF-16 Code Units, which are entire characters for English, most alphabetic text, common Chinese characters; but not Emojis and rare Chinese characters.


Related issue with mdBook.

Sunshine40 avatar Jun 05 '24 08:06 Sunshine40