perf: [kernel] switch defeq order in infer_app
(This was issue was discovered while debugging lean4lean.) The kernel is_def_eq function is almost commutative, so this makes little difference, but it has one non-commutative case:
https://github.com/leanprover/lean4/blob/fb0d0245db9bcb9ff649794f57522c52cb80c993/stage0/src/kernel/type_checker.cpp#L1006-L1013
This handles defeq problems of the form t =?= true but not true =?= s, with the intent that it will be used for proofs by reflection, in which a definition is executed in the kernel by proving it is equal to true.
But the way this is normally written, including in the case of simp_arith, is that you have a theorem of the form f x = true -> stuff and apply it to the argument Eq.refl true, such that typechecking the application forces this defeq check. But then f_type is f x = true -> stuff and a_type is true = true, so we actually end up with the defeq goal (true = true) =?= (f x = true) which does not trigger the fast path. Here we resolve the issue by changing the argument order for the is_def_eq check to say (f x = true) =?= (true = true) instead, which also avoids an unnecessary commutation since the f argument is on the left.
An alternative fix would be to add the symmetric version of the check, although it seems like a bit of a performance issue since the check is almost never taken and yet every single defeq check involves this test of equality to true.
Can someone !bench this?
!bench
Here are the benchmark results for commit 68e032402076a2d1ea897018668d2ec8ef52e2c9. There were no significant changes against commit 3e79ddda27c299a0e66fc996d52fa15fcf4421d8.
Also !bench for the control group (remove the optimization), since I suspect it is not getting hit in the first place, at least in the benchmark suite
!bench
Here are the benchmark results for commit 89dce3e2aff118683c89ef948cfc1d7b454ac276. There were no significant changes against commit 3e79ddda27c299a0e66fc996d52fa15fcf4421d8.
- ✅ Mathlib branch lean-pr-testing-2704 has successfully built against this PR. (2023-10-17 07:35:36) View Log
- ✅ Mathlib branch lean-pr-testing-2704 has successfully built against this PR. (2023-10-17 09:07:12) View Log
- 💥 Mathlib branch lean-pr-testing-2704 build failed against this PR. (2023-10-24 23:37:16) View Log
@semorrison Do you know if it is possible to run a benchmark comparison on mathlib? (I would like to have the above two benchmarks on mathlib: the defeq switch and the control group.)
@digama0 can we just have two different PRs for the two things you want to test? Then we will have two different toolchains, and we can make two separate Mathlib PRs to benchmark.
@semorrison Okay, I've removed the "control group" test from this PR and moved it to #2748.
@semorrison What is the status of this PR? I am surprised it breaks Mathlib. I have another PR on infer_app, but want to merge this one before.
I'm pretty sure the breakage is spurious; a lot of PRs are "breaking mathlib" because the bot merges master into the release PR, without compensating for this with the bump PRs in nightly-testing. IIUC @semorrison is aware of this problem, but the number of simultaneous breaking changes in flight is making it difficult to keep everything working.
Yes, given we had a prior success against Mathlib, I think we're safe in terms of regressions at Mathlib. There are some ideas for improving the Mathlib CI at #2785.
I'm not sure if we needed further benchmarking against Mathlib however -- @digama0, can you remind me if there is more to do here?
I think the most helpful test for this and other kernel perf: PRs is lean4checker --fresh Mathlib wallclock / instruction count / other perf metrics. Unfortunately I don't think we have this set up as a benchmark in velcom, but a manual measurement of any of the above would definitely help.
There is indeed some "unfinished business" in the form of a request from me to have this PR and #2748 compared against the baseline, but the mathlib breakage is getting in the way of the test so I still don't have any numbers regarding whether this difference is measurable at all.
I think the most helpful test for this and other kernel perf: PRs is lean4checker --fresh Mathlib wallclock / instruction count / other perf metrics. Unfortunately I don't think we have this set up as a benchmark in velcom, but a manual measurement of any of the above would definitely help.
Big +1
@digama0, adding lean4checker --fresh Mathlib is probably too expensive (20 minutes!) for Mathlib benchmarking. It seems that without --fresh would be just as good for this purpose, no? (Even that would be pretty expensive.)
Yes, it would also work without --fresh, although this will add multithreading and contention stuff so it may make the benchmarks more noisy. We could also consider testing only selected portions of mathlib (and other projects!), taking advantage of the fact that lean4checker / lean4lean can test specified modules without checking all of their dependencies when used in non---fresh mode.
@digama0 I think this is an important change. It would be great to include a benchmark that exposes the problem. Could you please add it?
@leodemoura I'll do my best, but as indicated above I don't even have evidence that the optimization does anything at all. That is, even in the best case when the optimization triggers it's not clear to me that it will be much faster than without, or what kind of example exaggerates the difference between the optimized and regular code paths.
I originally noticed this issue by logging the kernel behavior and noticing that the optimization code path is not triggered even in cases it was written for, i.e. "white box testing". I did not notice any particular slowdown as a result (I was noticing lots of other unrelated slowdowns at the time related to bugs in lean4lean, which is what I was trying to debug in the first place).
I have managed to confirm that on master, the last two examples trigger the fast path, and on this PR the first two trigger the fast path, which is as intended:
noncomputable def bla (n : Nat) : Bool := true -- insert complicated function here
theorem test (x) (_ : bla x = true) : True := sorry
theorem test2 (x) (_ : true = bla x) : True := sorry
kdef : True := test 7000 (trustme Eq.refl true)
kdef : True := test2 7000 (trustme Eq.refl true)
kdef : True := test 7000 (trustme Eq.refl (bla 7000))
kdef : True := test2 7000 (trustme Eq.refl (bla 7000))
What I have not been able to confirm is that the "fast path" is in fact faster. Instead I modified the code so that the fast path triggers a failure and observed the failure in the specified cases; but performance-wise I have not been able to see any difference, i.e. #2748 is perf-neutral. It is definitely doing different things and so it is possible that an example exists where the performance is different, but I think in order to really see the difference we need a better fast path than just calling whnf(t) which is basically what we end up doing anyway.