lean4 icon indicating copy to clipboard operation
lean4 copied to clipboard

fix: improve fuzzy-matching heuristics

Open rish987 opened this issue 3 years ago • 4 comments

Closes #1546. There were no bugs with the current implementation as far as I could tell, but I noticed three areas that could be improved here:

  1. Penalties should really only apply to the beginning of consecutive runs, and they shouldn't accumulate across namespaces. For example, if we had Lean.AVeryLongNamespace.AnotherVeryLongNamespace.SMap every miss would incur an additional penalty that would overwhelmingly negate the merit of a perfect match with the main identifier and give it no chance to show up in the top 100 results. So, rather than accumulating penalties for every miss, we apply a penalty once at the start of every consecutive run to indicate how "awkward" of a place to start a match this is. This awkwardness factor increases within a namespace but is reset at the beginning of every new namespace. However, we still do add a constant amount of penalty for every namespace so that we prefer less nested namespaces.
  2. Matches in the main identifier (after all of the namespaces) should get a bonus.
  3. Consecutive runs should be valued a lot more. What I've done is make a bonus for every match that is proportional to the length of the current run. So longer matches will get exponentially higher scores than ones that are more split up, which I think is better than a constant difference that could more easily have been negated.

LMK what you think of the above heuristics. After some (very brief) testing it looks like it's working great, but if someone could test-drive this and give me feedback it would be much appreciated!

rish987 avatar Oct 09 '22 14:10 rish987

Thanks for your contribution! Please make sure to follow our Commit Convention.

github-actions[bot] avatar Oct 09 '22 14:10 github-actions[bot]

@rish987 Could you rebase against master for benchmarking? Locally it looks like there is some overhead of up to 25%, unfortunately.

13:0: 
160678 decls
104155 matches

13:0: 
interpretation of _eval took 1.46s

14:0: 
160678 decls
143691 matches

14:0: 
interpretation of _eval took 3.18s

15:0: 
160678 decls
9890 matches

15:0: 
interpretation of _eval took 1.05s

16:0: 
160678 decls
334 matches

16:0: 
interpretation of _eval took 546ms

Kha avatar Oct 13 '22 19:10 Kha

!bench

Kha avatar Oct 13 '22 19:10 Kha

Here are the benchmark results for commit 2bbdd624dc9c127acdf46c563d5569790350a7c7. There were significant changes against commit 828aea48f4a4f4a2e09eb422a1fd94236a3d7b07:

  Benchmark          Metric         Change
  ===================================================
- workspaceSymbols   branches          16% (2831.8 σ)
- workspaceSymbols   instructions      18% (4354.7 σ)
- workspaceSymbols   task-clock        21%   (72.0 σ)
- workspaceSymbols   wall-clock        21%   (71.7 σ)

leanprover-bot avatar Oct 13 '22 20:10 leanprover-bot

On the other hand, this is still fast enough that we can test-drive it and then think about further behavior/performance refinements. The improvements should be worth it.

Kha avatar Oct 14 '22 14:10 Kha

At first glance I'm a bit puzzled that it's slower, since I actually made the result table that we have to populate with scores smaller, so there might be some expensive operation somewhere that I've introduced. Or maybe it's just the overhead of the new runLengths table that we're populating in parallel to result? I'll investigate later today.

rish987 avatar Oct 14 '22 14:10 rish987