App icon indicating copy to clipboard operation
App copied to clipboard

[$250] Search for emojis using a suffix tree instead of a trie to get smarter matches

Open roryabraham opened this issue 2 years ago β€’ 32 comments

If you haven’t already, check out our contributing guidelines for onboarding and email [email protected] to request to join our Slack channel!


Action Performed:

Start typing :smile

Expected Result:

The soon-to-be-implemented emoji suggestions UI should produce not only results that begin with smile, like smile and smiley, but also things that contain smile as a substring, such as sweat_smile.

Actual Result:

Only emoji that begin with smile will come up in the search results.

Workaround:

n/a

Platform:

All

View all open jobs on GitHub

roryabraham avatar Oct 27 '22 15:10 roryabraham

Triggered auto assignment to @adelekennedy (External), see https://stackoverflow.com/c/expensify/questions/8582 for more details.

melvin-bot[bot] avatar Oct 27 '22 15:10 melvin-bot[bot]

cc @stitesExpensify in case you want to be co-assigned on this one as well

roryabraham avatar Oct 27 '22 15:10 roryabraham

Triggered auto assignment to Contributor-plus team member for initial proposal review - @rushatgabhane (External)

melvin-bot[bot] avatar Oct 27 '22 15:10 melvin-bot[bot]

Current assignee @roryabraham is eligible for the External assigner, not assigning anyone new.

melvin-bot[bot] avatar Oct 27 '22 15:10 melvin-bot[bot]

This is hinted at in the issue title, but I think that a suffix tree will be the appropriate data structure to use for this search, since the emoji dataset is so large (and will likely continue to grow over time)

roryabraham avatar Oct 27 '22 15:10 roryabraham

Also, note that this does not necessarily have to be on HOLD for https://github.com/Expensify/App/issues/12188, because even though we don't have the UI yet, we should be able to verify the correct behavior with unit tests, which will be a hard requirement for any PR to implement this issue.

roryabraham avatar Oct 27 '22 15:10 roryabraham

@adelekennedy I think $500 is an appropriate starting value for this issue

roryabraham avatar Oct 27 '22 15:10 roryabraham

Not sure how to implement a suffix tree, but there are a few really good and fast fuzzy search libs out there. One that comes to mind is fuse.js

const list = [
  {
    "text": "smile",
    "emoji": "πŸ™‚"
  },
  {
    "text": "smiley",
    "emoji": "😁"
  },
  {
    "text": "sweat_smile",
    "emoji": "πŸ˜…"
  },
  {
    "text": "cheese_wedge",
    "emoji": "πŸ§€"
  },
]

const options = {
  includeScore: true,
  shouldSort: true,
  includeMatches: true,
  findAllMatches: true,
  threshold: 0.6,
  distance: 100,
  keys: ["text"]
};

const fuse = new Fuse(list, options);

// Change the pattern
const pattern = "smile"

return fuse.search(pattern)

Result:

[
  {
    "item": {
      "text": "smile",
      "emoji": "πŸ™‚"
    },
    "refIndex": 0,
    "matches": [
      {
        "indices": [
          [
            0,
            4
          ]
        ],
        "value": "smile",
        "key": "text"
      }
    ],
    "score": 2.220446049250313e-16
  },
  {
    "item": {
      "text": "smiley",
      "emoji": "😁"
    },
    "refIndex": 1,
    "matches": [
      {
        "indices": [
          [
            0,
            4
          ]
        ],
        "value": "smiley",
        "key": "text"
      }
    ],
    "score": 0.001
  },
  {
    "item": {
      "text": "sweat_smile",
      "emoji": "πŸ˜…"
    },
    "refIndex": 2,
    "matches": [
      {
        "indices": [
          [
            0,
            0
          ],
          [
            2,
            2
          ],
          [
            6,
            10
          ]
        ],
        "value": "sweat_smile",
        "key": "text"
      }
    ],
    "score": 0.06
  }
]

xyclos avatar Oct 29 '22 05:10 xyclos

there are a few really good and fast fuzzy search libs out there. One that comes to mind is fuse.js

i wouldn't call Fuse fast or good, exactly. but i've done a pretty broad comparison of fuzzy match libs in the course of building uFuzzy.

https://github.com/leeoniya/uFuzzy#performance

leeoniya avatar Oct 30 '22 14:10 leeoniya

@xyclos quick thoughts: I don't think filtering a list of emojis is a good usecase for an approximate string match algorithm.

We're searching on the keywords array. And fuse.js uses the bittap algorithm, which computes similarity between two strings using levenshtein distance (source).

This makes the result set have 522 results when entering smile (for eg: smirk is also in the result set because it has a levenshtein dist of 2 from smile).

Imo, 522 results would confuse the user. So something more predictable and exact would be nice. What do you think? I'm curious about your thoughts

rushatgabhane avatar Oct 30 '22 14:10 rushatgabhane

@leeoniya 1,030 ms for uFuzzy vs 35,600 ms for fuse.js

That's pretty impressive! ✨

image

rushatgabhane avatar Oct 30 '22 14:10 rushatgabhane

I agree with @rushatgabhane about not using a fuzzy search. For reference the Trie which is currently implemented finds emojis in 1-6ms.

I think a suffix tree as mentioned in the OP (which is very similar to a trie) is the correct data structure, and we should stick to that.

stitesExpensify avatar Oct 31 '22 19:10 stitesExpensify

:+1: excited to see what you folks come up with! i'd be interested to add a Ukkonen suffix tree implementation to the bench table as well. btw, uFuzzy is not based on string distance metrics, since that typically produces garbage results!

out of curiosity i loaded up all the emoji keywords from https://github.com/Expensify/App/blob/main/assets/emojis.js and got 0.3ms matches, or 0.7ms for out-of-order partial terms. this is with 0 mem overhead, since there's no index.

image

leeoniya avatar Oct 31 '22 22:10 leeoniya

@leeoniya those results are crazy fast! If you want to create a proposal, that sounds very promising!

I am a bit confused how there is 0 memory overhead since we would still have to create the data structure right? I'm also curious how long it takes to create that structure for our use case (seems like it should be fast using Ukkonen's algo).

stitesExpensify avatar Oct 31 '22 23:10 stitesExpensify

the data structure is just the haystack string[]. if you dont have this array in mem already and have to concat/extract the keywords arrays, there would be some minimal additional allocs up front, but uFuzzy itself doesnt do anything to your haystack once it's made. the lib is regexp based and just processes that haystack in a linear/dumb fashion. pretty wild how fast it turned out! 🀯

EDIT: to be clear, there is obviously some overhead during searching but it's the lowest i've measured of any lib (about 20mb when searching 162k phrases, that haystack is an additional/persistent 8mb in mem but it's not an extra 8mb cause you at minimum need those strings in memory regardless)

leeoniya avatar Oct 31 '22 23:10 leeoniya

@roryabraham we already have this feature because we are inserting the emoji name and its keywords into the Trie

i.e when inserting this emoji

    {
        code: 'πŸ˜€',
        keywords: [
            'face',
            'grin',
        ],
        name: 'grinning',
    },

we aren't inserting a single emoji {name:'grinning', {code: 'πŸ˜€'}}

actually we are inserting something like this

{name:'grinning', {code: 'πŸ˜€'}}
{name:'face', {code: '', suggestions: [name:'grinning', {code: 'πŸ˜€'}]}
{name:'grin', {code: '', suggestions: [name:'grinning', {code: 'πŸ˜€'}]}

so when typing :fa in the composer we will see the grinning emoji

https://user-images.githubusercontent.com/108357004/199466664-d8475d43-95f9-4351-9fc4-389b739dfd9b.mov

Karim-30 avatar Nov 02 '22 10:11 Karim-30

@adelekennedy I'm going to snag this one from you if you don't mind, as I'd like to help with the broader tracking issue.

JmillsExpensify avatar Nov 03 '22 03:11 JmillsExpensify

i grabbed your current Trie implementation to bench it against single terms: https://github.com/leeoniya/uFuzzy/tree/emojis-trie

i think @stitesExpensify's statement that your current code takes 1-6ms was likely measuring more than just the core search algorithm.

  • the trie construction takes ~20ms (vs 0.5ms to boot uFuzzy)
  • the trie retained ram use is ~2.9mb (vs 1.9mb for uFuzzy)
  • the trie bench time is ~30ms (vs 110ms for uFuzzy)

so the trie is definitely faster than a regex match, at the cost of more mem and more index build time, as expected! when the overall numbers for a single match are this low in both cases (sub 1ms), the perf difference can probably be disregarded. no one will notice a 20ms boot time either. this may change significantly for much larger datasets since the trie size and init will grow more than linearly with dataset size.

the big difference of course is match quality. the trie only handles exact prefix matches and cannot handle multiple/combined partial terms, out of order matching, with minor misspellings. for that you need something more sophisticated. maybe that's not needed for this case, and that's a perfectly valid choice as well :+1:

EDIT: to be fair, you can actually implement out-of-order and multi-term matches by running several searches over the trie (one for each term/prefix in the needle) and intersecting those results.

you can run the bench locally on that branch via these urls after running http-server (from npm) in the repo root dir:

http://localhost:8080/demos/compare.html?libs=uFuzzy&lists=emojis&bench http://localhost:8080/demos/compare.html?libs=EmojisTrie&haystack=emojis&bench

to see the matches displayed, just remove &bench

leeoniya avatar Nov 03 '22 03:11 leeoniya

@leeoniya Am I correct that a fuzzy search will get slower with larger datasets (which will certainly happen when we allow users to add custom emojis), whereas a suffix tree would not see any impact due to the search time being linear to the size of the search string rather than the size of the dataset?

I also wanted to mention that the suffix tree solution we recommended in the OP is both faster and takes less space than the trie we currently have implemented. You are correct that it still takes linear time to create though, so when the dataset grows that will be a bit slower.

stitesExpensify avatar Nov 03 '22 15:11 stitesExpensify

As to your point about match quality it does seem like a fuzzy search could be the way to go (and similar to what slack does).
2022-11-03_09-42-21

stitesExpensify avatar Nov 03 '22 15:11 stitesExpensify

cc @quinthar in case you're interested, since I believe you wrote (or helped write?) our concierge fuzzy search.

stitesExpensify avatar Nov 03 '22 15:11 stitesExpensify

I think fuzzy search makes sense, so typing things with small typos still gives you the match you're looking for, like @stitesExpensify last example

cead22 avatar Nov 03 '22 15:11 cead22

I'm less concerned about initialization time than search time, because that happens only once, on app initialization, and our current benchmarks for that are quite fast (~50ms). If if ever became a problem, we could also lazy-load the EmojiTrie so as not to block any UI except the emoji selector on its initialization.

roryabraham avatar Nov 03 '22 16:11 roryabraham

quick thoughts: I don't think filtering a list of emojis is a good usecase for an approximate string match algorithm.

We're searching on the keywords array. And fuse.js uses the bittap algorithm, which computes similarity between two strings using levenshtein distance (source).

This makes the result set have 522 results when entering smile (for eg: smirk is also in the result set because it has a levenshtein dist of 2 from smile).

Imo, 522 results would confuse the user. So something more predictable and exact would be nice. What do you think? I'm curious about your thoughts

If we're deciding whether or not to use fuzzy search, we should give counter arguments to this comment first.

rushatgabhane avatar Nov 03 '22 17:11 rushatgabhane

@leeoniya Am I correct that a fuzzy search will get slower with larger datasets (which will certainly happen when we allow users to add custom emojis), whereas a suffix tree would not see any impact due to the search time being linear to the size of the search string rather than the size of the dataset?

yeah, i would expect a trie or suffix tree to have better runtime scaling than fuzzy search, though by how much depends on how the fuzzy search is implemented (e.g. levenshtein in JS will scale worse than a regex that delegates to a C or asm backend).

If we're deciding whether or not to use fuzzy search, we should give counter arguments to https://github.com/Expensify/App/issues/12189#issuecomment-1296272527 first.

tl;dr; fuzzy !== levenshtein. relying on distance metrics alone for fuzzy search is guaranteed to produce poor results though it can do okay for single words with a distance of 1 or 2. there are many better ways to do it.

I'm less concerned about initialization time than search time, because that happens only once, on app initialization, and our current benchmarks for that are quite fast (~50ms). If if ever became a problem, we could also lazy-load the EmojiTrie so as not to block any UI except the emoji selector on its initialization.

tried with an emoji count of ~20k and get ~3.4ms per search, with out-of-order partial terms. but yes, a couple passes over a suffix tree would be faster than this. maybe in the range of 1-2ms?. not sure how much that matters in practice vs the other trade-offs.

image

leeoniya avatar Nov 03 '22 18:11 leeoniya

tl;dr; fuzzy !== levenshtein. relying on distance metrics alone for fuzzy search is guaranteed to produce poor results though it can do okay for single words with a distance of 1 or 2. there are many better ways to do it

ooo im interested in how it works. does it look for *s*m*i*l*e* ?

rushatgabhane avatar Nov 07 '22 05:11 rushatgabhane

Still working through the ideal solution. @stitesExpensify for clarity, I'd imagine this is still not a priority until we close out the LHN API refactors?

JmillsExpensify avatar Nov 08 '22 19:11 JmillsExpensify

That's correct. I'll get back to this ASAP

stitesExpensify avatar Nov 08 '22 19:11 stitesExpensify

Was chatting a bit with @JmillsExpensify and I think it might be a good idea to put this on HOLD so we can have a broader discussion about search features. I know emojis are the case-study here, but it's probably wise to take a step back here, evaluate our current and future needs, then design a performant, hopefully future-proof solution for search that will scale to meet our needs beyond just emojis.

roryabraham avatar Nov 11 '22 01:11 roryabraham

@stitesExpensify This is probably perfect timing for you, now that the LHN API refactor is wrapped up. There isn't any immediate urgency for emoji search, so it'd be good to make sure where ever we land approximates the use cases we'll need across chats, requests, card transactions, attachments and similar biz/consumer flows.

JmillsExpensify avatar Nov 11 '22 02:11 JmillsExpensify