fuzzysearch icon indicating copy to clipboard operation
fuzzysearch copied to clipboard

Support unicode fuzzy search

Open lewispham opened this issue 7 years ago • 2 comments

I've just took a look at the source code and I saw str.charCodeAt is used instead of str.codePointAt. So fuzzy searching unicode characters (multibytes characters for example) is probably not supported by this algorithm.

lewispham avatar Jan 17 '18 05:01 lewispham

seems to work https://runkit.com/meltuhamy/5b1faf06fba94100126a5bbb

meltuhamy avatar Jun 12 '18 11:06 meltuhamy

It would become significantly slower with codePointAt while in it's current form (with charCodeAt) it'll work in most cases but can report false positives when the text contains UTF-16 surrogate pairs to encode unicode characters/codepoints larger than 0xFFFF.

Example that reports false positive:

needle = "𠀨";
haystack = "𠀧𡀨";

// returns true
fuzzysearch(needle, haystack)

Explanation of the above code:

The haystack string consists of the 0x20027 and 0x21028 unicode codepoints while needle contains only the 0x20028 codepoint. These 3 codepoints in UTF-16 are encoded to the following 16-bit values (surrogate pairs, because all 3 codepoints are larger than 0xFFFF):

codepoint 0x20027 == utf16: uint16[ 0xD840, 0xDC27 ]
codepoint 0x20028 == utf16: uint16[ 0xD840, 0xDC28 ]
codepoint 0x21028 == utf16: uint16[ 0xD844, 0xDC28 ]

haystack == utf16: uint16[ 0xD840, 0xDC27, 0xD844, 0xDC28 ]
needle == utf16: uint16[ 0xD840, 0xDC28 ] 

How to fix this: (without sacrificing performance)

In UTF16 when a unicode codepoint (an actual unicode character) is encoded using two uint16 values (utf16 surrogate pair) then the first uint16 is always in the inclusive range [0xD800-0xDBFF] (high surrogate) while the second uint16 (in case of valid utf-16) is always in the inclusive range [0xDC00-0DFFF] (low surrogate). These ranges are reserved for surrogate pairs and none of the values in those ranges are used alone as unicode characters/codepoints.

If a charcode in needle is in the range [0xD800-0xDBFF] then this charcode and the charcode that follows it have to be treated as as a single unicode character so they have to be matched together as a pair and there can't be a gap between the two matching charcodes in haystack either.

EDIT: A fixed version would look like this:

function fuzzysearch (needle, haystack) {
    var hlen = haystack.length;
    var nlen = needle.length;
    if (nlen > hlen) {
        return false;
    }
    if (nlen === hlen) {
        return needle === haystack;
    }
    outer: for (var i = 0, j = 0; i < nlen; i++) {
        var nch = needle.charCodeAt(i);

        // handling a utf-16 surrogate pair
        if (nch >= 0xD800 && nch <= 0xDBFF) {
            if (++i >= nlen || j >= hlen) {
                return false
            }
            var nch2 = needle.charCodeAt(i);
            var hch = haystack.charCodeAt(j++);
            while (j < hlen) {
                var hch2 = haystack.charCodeAt(j++);
                if (hch === nch && hch2 === nch2) {
                    continue outer;
                }
                hch = hch2;
            }
            return false
        }

        // no utf-16 surrogate pair
        while (j < hlen) {
            if (haystack.charCodeAt(j++) === nch) {
                continue outer;
            }
        }
        return false;
    }
    return true;
}

pasztorpisti avatar May 27 '19 02:05 pasztorpisti