fuzzysearch
fuzzysearch copied to clipboard
Support unicode fuzzy search
I've just took a look at the source code and I saw str.charCodeAt
is used instead of str.codePointAt
. So fuzzy searching unicode characters (multibytes characters for example) is probably not supported by this algorithm.
seems to work https://runkit.com/meltuhamy/5b1faf06fba94100126a5bbb
It would become significantly slower with codePointAt
while in it's current form (with charCodeAt
) it'll work in most cases but can report false positives when the text contains UTF-16 surrogate pairs to encode unicode characters/codepoints larger than 0xFFFF.
Example that reports false positive:
needle = "𠀨";
haystack = "𠀧𡀨";
// returns true
fuzzysearch(needle, haystack)
Explanation of the above code:
The haystack
string consists of the 0x20027
and 0x21028
unicode codepoints while needle
contains only the 0x20028
codepoint. These 3 codepoints in UTF-16 are encoded to the following 16-bit values (surrogate pairs, because all 3 codepoints are larger than 0xFFFF):
codepoint 0x20027 == utf16: uint16[ 0xD840, 0xDC27 ]
codepoint 0x20028 == utf16: uint16[ 0xD840, 0xDC28 ]
codepoint 0x21028 == utf16: uint16[ 0xD844, 0xDC28 ]
haystack == utf16: uint16[ 0xD840, 0xDC27, 0xD844, 0xDC28 ]
needle == utf16: uint16[ 0xD840, 0xDC28 ]
How to fix this: (without sacrificing performance)
In UTF16 when a unicode codepoint (an actual unicode character) is encoded using two uint16 values (utf16 surrogate pair) then the first uint16 is always in the inclusive range [0xD800-0xDBFF]
(high surrogate) while the second uint16 (in case of valid utf-16) is always in the inclusive range [0xDC00-0DFFF]
(low surrogate). These ranges are reserved for surrogate pairs and none of the values in those ranges are used alone as unicode characters/codepoints.
If a charcode in needle
is in the range [0xD800-0xDBFF]
then this charcode and the charcode that follows it have to be treated as as a single unicode character so they have to be matched together as a pair and there can't be a gap between the two matching charcodes in haystack
either.
EDIT: A fixed version would look like this:
function fuzzysearch (needle, haystack) {
var hlen = haystack.length;
var nlen = needle.length;
if (nlen > hlen) {
return false;
}
if (nlen === hlen) {
return needle === haystack;
}
outer: for (var i = 0, j = 0; i < nlen; i++) {
var nch = needle.charCodeAt(i);
// handling a utf-16 surrogate pair
if (nch >= 0xD800 && nch <= 0xDBFF) {
if (++i >= nlen || j >= hlen) {
return false
}
var nch2 = needle.charCodeAt(i);
var hch = haystack.charCodeAt(j++);
while (j < hlen) {
var hch2 = haystack.charCodeAt(j++);
if (hch === nch && hch2 === nch2) {
continue outer;
}
hch = hch2;
}
return false
}
// no utf-16 surrogate pair
while (j < hlen) {
if (haystack.charCodeAt(j++) === nch) {
continue outer;
}
}
return false;
}
return true;
}