command-t Improve matching performance.

trafficstars

Note: This is based off of #253 and #258 .

This improves matching performance by using a trie to store the paths. Hit-or-miss matching is then done iteratively, allowing the pruning of subtrees and requiring less work to identify matches.

Additionally this structure usually uses much less memory as common prefixes of paths are only stored once.

Testing on benchmarks and use cases shows a 2-10x performance improvement for common scenarios. For some edge cases effectively infinite speedup can be seen as huge numbers of paths can be pruned. This is particularly common with a lot of hidden paths.

One side note is that I kept the threading feature in although in practice it seems to provide a very small improvement and only on very large sets (>1M). Future work will to investigate why there is little benefit and either fix or remove the threading. (Removing provides a small performance boost as some checks can be removed).

Jun 18 '17 17:06 kevincox

Can't really review this on my phone but looks pretty epic. The trie makes sense: trade a bit of overhead for its construction in exchange for cheaper search space pruning during matching. Great stuff. Thanks for contributing it.

Jun 18 '17 18:06 wincent

@kevincox: would you be able to rebase this against "master"?

Jun 22 '17 14:06 wincent

If you are interested in merging I gladly will. However it is Dependant on the C Scanner Infrastructure so that will have to be merged first. If that sounds good let me know. I should have some free time this weekend.

Jun 22 '17 15:06 kevincox

Yeah, definitely interested. I just want to confirm your benchmark results on top of current master before going too far.

Jun 22 '17 16:06 wincent

Well I'll try to start porting to current master. Also I realized the reason I wasn't seeing much parallel speedup was because the benchmarks had no limit. Adding a limit the multithreading is quite effective.

Jun 22 '17 18:06 kevincox

The rebase was simpler then expected. This is now based on master.

Jun 24 '17 10:06 kevincox

Thanks, @kevincox. So I had a chance to play around with this. The "big" matcher benchmark is indeed a bit faster, but I found in practice that on a large repo (1.5m files) this is significantly slower overall. How large a repo have you been testing this on? I think the approach here shows promise, but it will need to meet and improve the bar on large repos in order to be useful (because the truth is, the perf doesn't matter on small repos, because anything is fast enough there).

Jun 27 '17 01:06 wincent

That's odd. I was trying it on a set of ~4M files (my real world use case) and it was significantly faster. Can you explain exactly what your setup is.

What do the files look like. Are they a shallow set or deeply nested?
Are you using threads? And how many cores does your system have?
- What is the difference with and without threads?
Are you setting a limit on the number of results? How large is it?
What OS are you on?
What compiler and optimization level are you using?
Could you give me concrete numbers? I'm wondering it both before and after are fast or slow.
What does your query look like and how much of the set is it actually matching and how selective is the mask?

Jun 27 '17 06:06 kevincox

Hmm, it actually appears like there was a regression in the cleanup. I'll look into it.

However even still the performance is better then master for me with ~2M paths and a real query. The numbers below are running the matcher benchmark as it exists in the latest commit in this branch (except master had to pass the paths differently) on a 6 physical 12 logical core linux system.

# master
                     best      avg       sd   +/- p   (best)      (avg)      (sd)   +/- p
     pathological   1.13000   1.22750 0.06418  ?     (1.12612)  (1.22599) (0.06424)  ?   
        command-t   0.89000   0.90750 0.03031  ?     (0.88601)  (0.90683) (0.03032)  ?   
chromium (subset)   3.79000   3.80500 0.01118  ?     (1.03332)  (1.07116) (0.04665)  ?   
 chromium (whole)   4.29000   4.37250 0.07361  ?     (0.58969)  (0.64701) (0.05316)  ?   
       big (400k)   6.66000   6.67750 0.01479  ?     (0.90809)  (0.96113) (0.05408)  ?   
             real 385.38000 386.36750 0.70892  ?    (46.41973) (47.30632) (0.54375)  ?   
# faster-filter
                     best      avg       sd    +/- p   (best)      (avg)      (sd)   +/- p
     pathological   1.75000   2.10500  0.27409  ?     (1.74928)  (2.10546) (0.27541)  ?   
        command-t   1.39000   1.49500  0.07921  ?     (1.38612)  (1.49320) (0.07835)  ?   
chromium (subset)   2.83000   3.15750  0.23317  ?     (2.82870)  (3.14835) (0.22487)  ?   
 chromium (whole)   4.22000   4.45000  0.33786  ?     (0.60953)  (0.67107) (0.06295)  ?   
       big (400k)   5.28000   5.73500  0.38817  ?     (0.84286)  (0.93130) (0.09734)  ?   
             real 336.87000 367.88250 30.48526  ?    (42.95834) (46.65727) (3.38012)  ?

Jun 27 '17 09:06 kevincox

Sorry for the delay, life gets in the way.

I looked into it and it appears that the reason the benchmarks look very similar is because this new branch is slightly slower when everything matches. This is because because every item still gets scored then sorted, and scoring is expensive. However the old method scores right away so it is doing the minimal amount of work (not counting cache differences), however it lacks the fast path.

Once your selector has much selectivity at all the new version is significantly faster as only matching entries are scored. The benchmark.rb currently benchmarks every prefix of the query so it spends most of it's time on the short non-selective queries.

I suspect that this isn't too useful of a measure because I tend to type a number of characters before I wait for the result. So while the first couple of searches are roughtly the same performance it doesn't matter to me because I only start using the results once I have typed a number of characters and the result set is significantly reduced.

Benchmark just using the full query:

# master
                    best    avg      sd   +/- p   (best)    (avg)      (sd)   +/- p
     pathological 0.05000 0.05733 0.00573  ?    (0.05428) (0.05838) (0.00570)  ?   
        command-t 0.07000 0.07467 0.00618  ?    (0.07070) (0.07511) (0.00282)  ?   
chromium (subset) 0.14000 0.15400 0.00712  ?    (0.03793) (0.04515) (0.00383)  ?   
 chromium (whole) 0.30000 0.31333 0.00699  ?    (0.05481) (0.05879) (0.00315)  ?   
       big (400k) 0.44000 0.46200 0.01046  ?    (0.07632) (0.08399) (0.00327)  ?   
# faster-filter
Summary of cpu time and (wall-clock time):
                    best    avg      sd   +/- p   (best)    (avg)      (sd)   +/- p
     pathological 0.10000 0.18800 0.03563  ?    (0.09133) (0.18863) (0.03951)  ?   
        command-t 0.06000 0.06600 0.00490  ?    (0.06305) (0.06872) (0.00324)  ?   
chromium (subset) 0.04000 0.06333 0.00869  ?    (0.04379) (0.05980) (0.00785)  ?   
 chromium (whole) 0.02000 0.03733 0.00772  ?    (0.00724) (0.00854) (0.00077)  ?   
       big (400k) 0.06000 0.06400 0.00490  ?    (0.01074) (0.01278) (0.00100)  ?

Aug 06 '17 21:08 kevincox

Interesting. Thanks for the analysis.

The benchmark.rb currently benchmarks every prefix of the query

Yep, this is pretty intentional because it reflects what I think is a fairly common use pattern. (Although I totally take your point about your preference for typing in bursts, and relying on debouncing to prevent intermediate searches from being dispatched.)

A couple of thoughts:

I wonder whether it would make sense to have a hybrid strategy that uses the method that is faster for "short non-selective queries", then cuts over to the other strategy (which your benchmarks seem to show is faster for long, selective queries) at some point. Obviously, would add some complexity, and the pay-off may not sufficiently justify it.
You mention that "every item still gets scored then sorted, and scoring is expensive". There is some nuance to that. Scoring is costly, but sorting is too (got a big speed-up from reducing the sorting workload by switching to a heap-based strategy in 03aedeec18f961a6926f0245, and then another substantial bump when we avoided some unnecessary heap operations in 34b03d5ba47d9). Commit ee97325119c7fe5933e was another big bump because it makes scoring basically free after the first non-match (that is, if you search for "abc" and it doesn't match, when you add another character and search for "abcd" we don't bother scoring at all seeing as we know it can't possibly match, meaning we can cheaply prune the search space much like you are with the trie, at least for this case).

Thanks for you work and analysis on this. I'm not going to pull the trigger on this yet because I do think we need to solve the sluggish feel on large repos when not using g:CommandTInputDebounce before we merge it.

Aug 18 '17 18:08 wincent

The benchmark.rb currently benchmarks every prefix of the query Yep, this is pretty intentional...

I figured it was, and I knew about it but it slipped my mind. I think in the future it would be benficial to have a couple of different benchmarks so that it is easier to identify where the performance changes, but this is probably sufficent.

I wonder whether it would make sense to have a hybrid strategy...

I'm not sure it is. I very rarely use non-selective queries. Also I suspect the less selective your query the more you are relying on scoring to suface an interesting result. Searching a large corpus with a non-selective query doesn't seem particularly useful otherwise.

There is some low-hanging fruit in this change to make the empty string only use on thread and just read off the first limit items. Currently it spawns all the threads and scans everything. However since the scoring has a short circuit for the empty needle it is still fairly fast (~30% of time is in scoring) and I didn't implement it. (It's probably enought to just set sort = false, threads = 1 and this will work).

Scoring is costly, but sorting is too.

The first two changes are preserved, but I haven't looked into the partial results one. It could probably save some cycles on the incremental case so I'll look into it. It is unclear if it will be as clear of a win in this case because we already do some pruning, however more pruning is unlikely to hurt.

the sluggish feel on large repos when not using g:CommandTInputDebounce

I haven't tried this, I'll play with the setting and see the effect.

Aug 18 '17 19:08 kevincox

I'm not sure it is. I very rarely use non-selective queries. Also I suspect the less selective your query the more you are relying on scoring to suface an interesting result. Searching a large corpus with a non-selective query doesn't seem particularly useful otherwise.

In order to actually find the needle in the haystack you're generally going to need something pretty "selective", as you say, but that doesn't mean that performance of "non-selective" queries doesn't matter. The benchmarks don't tell the full story; this is why I commented above "I found in practice that on a large repo (1.5m files) this is significantly slower overall" — it felt like treacle, not a subtle effect at all. I didn't need to run the benchmarks to see the regression, and I bet other people working in huge repos would have seen it too; this isn't a true edge-case like the "pathological" dataset in the benchmarks, but the really common starting-to-search case, so it's pretty important that we don't regress it.

[Edit: is → isn't]

Aug 18 '17 20:08 wincent

I found in practice that on a large repo (1.5m files) this is significantly slower overall

I wonder what makes your large repo test slow. Because I tested with 0.5M, 2M and 4M and saw only minor differences (< 20% for empty queries). If you can figure out why your use case is so slow it might be possible to optimize it.

I have been using this since before I opened the request originaly with the 2M, but later cut down to 0.5M repo and it feels significantly faster for me in both cases.

it's pretty important that we don't regress it.

For sure, I think it should be possible to make this change with no, or only very minor regressions in the worst cases.

Aug 18 '17 21:08 kevincox

I'll add a little checklist for planned improvements:

[ ] Cache results of previous search. This will allow faster pruning when search is strictly more selective. Restricts to a single concurrent reader but that is probably fine.
[ ] Seperate benchmarks for empty, non-selective, selective and extending cases.

Aug 18 '17 21:08 kevincox

Given the big rewrite for v6.0.x, older PRs will need to be re-rolled against main and target the new version, so I'm closing this one out. Feedback issue for 6.0.x is here:

https://github.com/wincent/command-t/issues/393

Aug 26 '22 21:08 wincent

command-t command-t copied to clipboard

Improve matching performance.

command-t
command-t copied to clipboard