command-t icon indicating copy to clipboard operation
command-t copied to clipboard

Look into keeping memoized score matrix around between key presses

Open wincent opened this issue 8 years ago • 1 comments

When we compute the score for a path, we make a memoization matrix (a 2-dimensional array) of size needle-length * haystack-length. These are stack-allocated, so they're cheap and short-lived.

If we could keep the matrix around between key-presses, we could do a cheap incremental scoring operation for each new key. For example, just say we had a matrix for path "foo/bar/baz" and search string "foo":

   | f | o | o | / | b | a | r | / | b | a | z
---|---|---|---|---|---|---|---|---|---|---|---
 f | . | . | . | . | . | . | . | . | . | . | .
---|---|---|---|---|---|---|---|---|---|---|---
 o | . | . | . | . | . | . | . | . | . | . | .
---|---|---|---|---|---|---|---|---|---|---|---
 o | . | . | . | . | . | . | . | . | . | . | .

If we want to now search for "foob", then we could re-use the existing matrix and just add one row:

   | f | o | o | / | b | a | r | / | b | a | z
---|---|---|---|---|---|---|---|---|---|---|---
 f | . | . | . | . | . | . | . | . | . | . | .
---|---|---|---|---|---|---|---|---|---|---|---
 o | . | . | . | . | . | . | . | . | . | . | .
---|---|---|---|---|---|---|---|---|---|---|---
 o | . | . | . | . | . | . | . | . | . | . | .
---|---|---|---|---|---|---|---|---|---|---|---
 b | . | . | . | . | . | . | . | . | . | . | .

In the crude version of this idea, viable only on small repos, we just keep every matrix around between key presses. Unfortunately, you don't need optimization on small repos because they are already fast enough, and on big repos it's not viable to keep that much stuff in memory: for example, consider 1,000,000 files of average length 100 characters (per path), and a search query of 10 characters: that's a gigabyte of memory right there.

We can probably cut down on that in a few ways:

  • Don't keep matrixes around for non-matching paths (huge wins once search query size grows beyond a certain range, but not so big when the search query is short, which is unfortunately precisely when we need the speed boost — because the search space is still large — that would come from implementing the incremental scoring).
  • Don't keep matrixes around for matching paths beyond the display limit (enormous wins, but not guaranteed to produce valid results; for example, typing additional characters may cause an entry that previously did not score in the top-limit results to move up and displace one of those results).
  • Only do this for repos somewhere near some threshold size; eg. 100,000 files, which is about the smallest size at which this might be worth doing, but not so large that it would take a prohibitively large amount of memory (albeit still a lot, of the order of about 100 MB); with such a narrow window of viability, probably not worth exploring this approach.
  • It's possible that we don't actually need to keep the full matrix around for each path: maybe we could "flatten" each one down to a single array that detailed the maximum cumulative score for each letter in the search string, and the index to which it corresponds; that's only O(n) storage with a reasonable constant factor, and we're already keeping the paths around, which is O(n) too. It's not clear whether this approximation would be good enough, or whether it would preclude us from computing higher scoring matches that we might have found with a different index; likewise, it could prevent us from finding a valid match at all in some cases.

Other gotchas:

  • Maximum score per character is related to needle length (edit: double-check this; it's possible that it's only related to haystack length, which is constant), which means that the contents of the memoization matrix may not be validly reusable in an incremental computation without some kind of transformation which may itself be as expensive as just re-doing the calculation.
  • Dealing with growth of the memoization matrices as we append rows to them either requires us to overallocate or spend expensive cycles reallocating; would have to explore the balance here pretty carefully.
  • In general dealing with lots of small allocations is likely to be very costly and we'd probably end up having to allocate a slab of our own and end up parceling out small slices of it in a way that was deeply integrated into the scoring algorithm.

In any case, need to think some more about the idea, but wanted to get some thoughts into an issue before I forget them.

wincent avatar Apr 19 '17 06:04 wincent

Confirmed, maximum score per character is related to needle length, and haystack length too:

m.max_score_per_char    = (1.0 / m.haystack_len + 1.0 / m.needle_len) / 2;

Could revisit that design but it would be a big change.

wincent avatar Apr 19 '17 06:04 wincent

Given the big rewrite for v6.0.x, I'm closing all older issues as there is unlikely to be anything significant happening on the 5-x-devel branch from here on[^patches]. Feedback issue for 6.0.x is here:

  • https://github.com/wincent/command-t/issues/393

[^patches]: Patches and PRs would be welcome, but my personal efforts are going to be directed towards main.

wincent avatar Aug 26 '22 21:08 wincent