lean4
lean4 copied to clipboard
fix: improve fuzzy-matching heuristics
Closes #1546. There were no bugs with the current implementation as far as I could tell, but I noticed three areas that could be improved here:
- Penalties should really only apply to the beginning of consecutive runs, and they shouldn't accumulate across namespaces. For example, if we had
Lean.AVeryLongNamespace.AnotherVeryLongNamespace.SMapevery miss would incur an additional penalty that would overwhelmingly negate the merit of a perfect match with the main identifier and give it no chance to show up in the top 100 results. So, rather than accumulating penalties for every miss, we apply a penalty once at the start of every consecutive run to indicate how "awkward" of a place to start a match this is. This awkwardness factor increases within a namespace but is reset at the beginning of every new namespace. However, we still do add a constant amount of penalty for every namespace so that we prefer less nested namespaces. - Matches in the main identifier (after all of the namespaces) should get a bonus.
- Consecutive runs should be valued a lot more. What I've done is make a bonus for every match that is proportional to the length of the current run. So longer matches will get exponentially higher scores than ones that are more split up, which I think is better than a constant difference that could more easily have been negated.
LMK what you think of the above heuristics. After some (very brief) testing it looks like it's working great, but if someone could test-drive this and give me feedback it would be much appreciated!
Thanks for your contribution! Please make sure to follow our Commit Convention.
@rish987 Could you rebase against master for benchmarking? Locally it looks like there is some overhead of up to 25%, unfortunately.
13:0:
160678 decls
104155 matches
13:0:
interpretation of _eval took 1.46s
14:0:
160678 decls
143691 matches
14:0:
interpretation of _eval took 3.18s
15:0:
160678 decls
9890 matches
15:0:
interpretation of _eval took 1.05s
16:0:
160678 decls
334 matches
16:0:
interpretation of _eval took 546ms
!bench
Here are the benchmark results for commit 2bbdd624dc9c127acdf46c563d5569790350a7c7. There were significant changes against commit 828aea48f4a4f4a2e09eb422a1fd94236a3d7b07:
Benchmark Metric Change
===================================================
- workspaceSymbols branches 16% (2831.8 σ)
- workspaceSymbols instructions 18% (4354.7 σ)
- workspaceSymbols task-clock 21% (72.0 σ)
- workspaceSymbols wall-clock 21% (71.7 σ)
On the other hand, this is still fast enough that we can test-drive it and then think about further behavior/performance refinements. The improvements should be worth it.
At first glance I'm a bit puzzled that it's slower, since I actually made the result table that we have to populate with scores smaller, so there might be some expensive operation somewhere that I've introduced. Or maybe it's just the overhead of the new runLengths table that we're populating in parallel to result? I'll investigate later today.