repl
repl copied to clipboard
[BUG] REPL accepts incorrect proofs
I found a peculiar bug in REPL, where it can accept any proof that applies the theorem itself.
Example:
{"cmd": "theorem test (p q : Prop) (hp : p) (hq : q) : p ∧ q ∧ p := by sorry"}
{"sorries":
[{"proofState": 0,
"pos": {"line": 1, "column": 62},
"goal": "p q : Prop\nhp : p\nhq : q\n⊢ p ∧ q ∧ p",
"endPos": {"line": 1, "column": 67}}],
"messages":
[{"severity": "warning",
"pos": {"line": 1, "column": 8},
"endPos": {"line": 1, "column": 12},
"data": "declaration uses 'sorry'"}],
"env": 0}
{"tactic": "apply test", "proofState": 0}
{"proofState": 1,
"goals":
["case hp\np q : Prop\nhp : p\nhq : q\n⊢ p",
"case hq\np q : Prop\nhp : p\nhq : q\n⊢ q"]}
In the above example, we cannot complete the proof by applying itself. However, REPL does not raise any error messages. However, on VS Code IDE for Lean 4, we get the error message for the proof below:
theorem test (p q : Prop) (hp : p) (hq : q) : p ∧ q ∧ p := by
apply test
exact hp
exact hq
Expected Error message:
fail to show termination for
Lean4Proj1.test
with errors
structural recursion cannot be used
Could not find a decreasing measure.
The arguments relate at each recursive call as follows:
(<, ≤, =: relation proved, ? all proofs failed, _: no proof attempted)
hp hq
1) 5:8-12 _ _
Please use `termination_by` to specify a decreasing measure.
Your two examples are not the same. Did you try attempting to complete the proof using exact hp; exact hq in the REPL?
It is expected that calling apply test does not result in an immediate error. This is a valid proof strategy for proofs by induction, and the termination check only runs once the definition or proof is complete.
I forgot to paste the full output
{"cmd": "theorem test (p q : Prop) (hp : p) (hq : q) : p ∧ q ∧ p := by sorry"}
{"sorries":
[{"proofState": 0,
"pos": {"line": 1, "column": 62},
"goal": "p q : Prop\nhp : p\nhq : q\n⊢ p ∧ q ∧ p",
"endPos": {"line": 1, "column": 67}}],
"messages":
[{"severity": "warning",
"pos": {"line": 1, "column": 8},
"endPos": {"line": 1, "column": 12},
"data": "declaration uses 'sorry'"}],
"env": 0}
{"tactic": "apply test", "proofState": 0}
{"proofState": 1,
"goals":
["case hp\np q : Prop\nhp : p\nhq : q\n⊢ p",
"case hq\np q : Prop\nhp : p\nhq : q\n⊢ q"]}
{"tactic": "exact hp", "proofState": 1}
{"proofState": 2, "goals": ["case hq\np q : Prop\nhp : p\nhq : q\n⊢ q"]}
{"tactic": "exact hq", "proofState": 2}
{"proofState": 3, "goals": []}
As you can there is no error message here.
Also NOTE: The VS Code IDE shows the error message as soon as we use apply test tactic.
Also exact hp; exact hq won't work in REPL because it doesn't support tactical (;)
Is this actually a complete proof though? I'm not super familiar with the REPL API but I would expect some kind of qed-equivalent to complete the proof once all the goals are finished. It may well be that this is a (design) bug in the REPL in that it has no way to express errors that occur when entering the definition into the environment.
PS: ; isn't a tactical, it is just a list of tactics. You have to pass them separately with the REPL API since it takes single tactics, or enclose it in parentheses like (exact hp; exact hq).
I just followed the example shown in the README (https://github.com/leanprover-community/repl/blob/master/README.md), I may be wrong, but it doesn't look like there is something like qed from the examples.
At present there is nothing you can do with a completed proof state: we would like to extend this so that you can replace the original sorry with your tactic script, and obtain the resulting Environment.
So apparently QED is not implemented yet. If it was, this is the step that would give the error.
Hi, Happy new year!
So, unfortunately, in this context, it seems that this makes the exact? tactic slightly dangerous to use as it will "prove" anything:
{"cmd": "theorem ex : False := sorry"}
{"sorries":
[{"proofState": 0,
"pos": {"line": 1, "column": 22},
"goal": "⊢ False",
"endPos": {"line": 1, "column": 27}}],
"messages":
[{"severity": "warning",
"pos": {"line": 1, "column": 8},
"endPos": {"line": 1, "column": 10},
"data": "declaration uses 'sorry'"}],
"env": 0}
{"proofState": 0, "tactic": "exact?"}
{"proofState": 1,
"messages":
[{"severity": "info",
"pos": {"line": 0, "column": 0},
"endPos": {"line": 0, "column": 0},
"data": "Try this: exact _root_.ex"}],
"goals": []}
exact? is referring to the theorem itself to prove it as suggested by "Try this: exact _root_.ex".
The exact? case is handled here:
https://github.com/leanprover-community/repl/commit/66515025d9ccffa2be29718d31b816edd0675202
However, it only works if the theorem is initialized in an existing environment, e.g.
{"cmd": "#eval 1 + 1"}
{"cmd": "theorem ex : False := sorry", "env": 0}
{"proofState": 0, "tactic": "exact?"}
exact ex still works here. I guess for recursion it resolves from somewhere else than ctx.env (resolveLocalName it seems?)
~The question is, why does the InfoTree from elaborating the definition have as its root a ContextInfo with an Environment that includes in constants the definition itself, if it is not necessary for recursion and messes with exact?? It becomes an issue when the definition creates e.g. foo and foo.match_1 and you want only foo.match_1 to be in the env of the sorry. As it is currently, https://github.com/leanprover-community/repl/commit/66515025d9ccffa2be29718d31b816edd0675202 ensures that neither foo nor foo.match_1 are in the env.~ #66
Interesting, thanks!
Having the definition in constants is probably useful/necessary for the pickle feature.
I found another example where the REPL accepts an incorrect proof:
amc12a_2002_p12 from minif2f.
import Mathlib
theorem amc12a_2002_p12 (f : ℝ → ℝ) (k : ℝ) (a b : ℕ) (h₀ : ∀ x, f x = x ^ 2 - 63 * x + k)
(h₁ : f a = 0 ∧ f b = 0) (h₂ : a ≠ b) (h₃ : Nat.Prime a ∧ Nat.Prime b) : k = 122 := by
apply (mul_right_inj' (sub_ne_zero.2 ?_)).1
ring_nf at h₁ h₂ ⊢
This can be reproduced with the following REPL commands:
{"cmd": "import Mathlib"}
{"cmd": "theorem amc12a_2002_p12 (f : ℝ → ℝ) (k : ℝ) (a b : ℕ) (h₀ : ∀ x, f x = x ^ 2 - 63 * x + k)\n (h₁ : f a = 0 ∧ f b = 0) (h₂ : a ≠ b) (h₃ : Nat.Prime a ∧ Nat.Prime b) : k = 122 := by sorry", "env": 0}
{"tactic": "apply (mul_right_inj' (sub_ne_zero.2 ?_)).1", "proofState": 0}
{"tactic": "ring_nf at h₁ h₂ ⊢", "proofState": 1}
I've tested on PR #63, but it still fails to catch the invalid proofs.
I found another example where the REPL accepts an incorrect proof:
amc12a_2002_p12 from minif2f.
import Mathlib theorem amc12a_2002_p12 (f : ℝ → ℝ) (k : ℝ) (a b : ℕ) (h₀ : ∀ x, f x = x ^ 2 - 63 * x + k) (h₁ : f a = 0 ∧ f b = 0) (h₂ : a ≠ b) (h₃ : Nat.Prime a ∧ Nat.Prime b) : k = 122 := by apply (mul_right_inj' (sub_ne_zero.2 ?_)).1 ring_nf at h₁ h₂ ⊢This can be reproduced with the following REPL commands:
{"cmd": "import Mathlib"} {"cmd": "theorem amc12a_2002_p12 (f : ℝ → ℝ) (k : ℝ) (a b : ℕ) (h₀ : ∀ x, f x = x ^ 2 - 63 * x + k)\n (h₁ : f a = 0 ∧ f b = 0) (h₂ : a ≠ b) (h₃ : Nat.Prime a ∧ Nat.Prime b) : k = 122 := by sorry", "env": 0} {"tactic": "apply (mul_right_inj' (sub_ne_zero.2 ?_)).1", "proofState": 0} {"tactic": "ring_nf at h₁ h₂ ⊢", "proofState": 1}I've tested on PR #63, but it still fails to catch the invalid proofs.
Hello, I have encountered the same issue. Have you resolved it now?
You can use our tool: https://github.com/trishullab/itp-interface
It is based on REPL but doesn't accept wrong proofs. It also allows parallel proof execution, handling memory issues which might arise when you use too many tactics in your proof while using REPL, and a lot of other fixes. It is in Python and can run parallely on ray clusters.
You can use our tool: https://github.com/trishullab/itp-interface
It is based on REPL but doesn't accept wrong proofs. It also allows parallel proof execution, handling memory issues which might arise when you use too many tactics in your proof while using REPL, and a lot of other fixes. It is in Python and can run parallely on ray clusters.
Hi, @amit9oct , I run the repl at https://github.com/amit9oct/repl with lean4 version v4.16.0. The commads is: {"cmd": "import Mathlib"}
{"cmd": "theorem amc12a_2002_p12 (f : ℝ → ℝ) (k : ℝ) (a b : ℕ) (h₀ : ∀ x, f x = x ^ 2 - 63 * x + k)\n (h₁ : f a = 0 ∧ f b = 0) (h₂ : a ≠ b) (h₃ : Nat.Prime a ∧ Nat.Prime b) : k = 122 := by sorry", "env": 0}
{"tactic": "apply (mul_right_inj' (sub_ne_zero.2 ?_)).1", "proofState": 0}
{"tactic": "ring_nf at h₁ h₂ ⊢", "proofState": 1}
but it still accept wrong proofs.
No use the python interface and not the native REPL to run your code. See the README https://github.com/trishullab/itp-interface (we have also tested our code on miniF2F and it works). The key is to not use tactic mode at all
Our python code uses REPL but only for barebones Lean 4 support, we don't use pickling or tactic mode which might be prone to bugs
To summarize, there are four bug cases in tactic mode.
Accept False Proof
1. Application Type Mismatch
import Mathlib
theorem demo (x : ℕ) : ∃x, ∃ g : (∀ n : ℕ, ¬Nat.Prime x), ¬x ≤ x := by
use x
have g6 : 2 % 2 < 2 := by
refine Nat.mod_lt 2 ?h
simp
2. Placeholder Synthesis
theorem amc12a_2002_p12 (f : ℝ → ℝ) (k : ℝ) (a b : ℕ)
(h₀ : ∀ x, f x = x ^ 2 - 63 * x + k)
(h₁ : f a = 0 ∧ f b = 0) (h₂ : a ≠ b)
(h₃ : Nat.Prime a ∧ Nat.Prime b) : k = 122 := by
apply (mul_right_inj' (sub_ne_zero.2 ?_)).1
ring_nf at h₁ h₂ ⊢
3. Self-Application
theorem imo2024_p2 :
{(a, b) | 0 < a ∧ 0 < b ∧
∃ g N, 0 < g ∧ 0 < N ∧
∀ n ≥ N, Nat.gcd (a ^ n + b) (b ^ n + a) = g}
= {(1, 1)} := by
ext n
symm
rw [imo2024_p2]
Reject Valid Proof
4. Set Option Failures
import Mathlib
set_option maxHeartbeats 0
set_option maxRecDepth 1000000
theorem mathd_numbertheory_314 (r n : ℕ)
(h₀ : r = 1342 % 13) (h₁ : 0 < n)
(h₂ : 1342 ∣ n) (h₃ : n % 13 < r) :
6710 ≤ n := by
contrapose! h₃
interval_cases n <;> simp_all (config := {decide := true})
Here are some minimal examples of the "Accept False Proof" bugs reported by @RexWzh, together with errors shown in Lean. All of these proofs pass as valid in REPL's tactic mode.
example : 1 = 0 := by
cases 1
rfl
apply ?succ
Lean error:
(kernel) declaration has metavariables '_example'
example : 1 = 0 := by
cases 1
rfl
have: true := by
apply ?succ
trivial
Lean error:
application type mismatch
let_fun this := of_decide_eq_true (id (Eq.refl true));
of_decide_eq_true (id (Eq.refl true))
argument has type
true = true → true = true
but function has type
(∀ (x : true = true), (fun this ↦ n✝ + 1 = 0) x) → (fun this ↦ n✝ + 1 = 0) ⋯
example : 1 = 0 := by
cases' exists_ne 2 with n h
simpa [] using h _
Lean error:
don't know how to synthesize placeholder
context:
n : ℕ
h : n ≠ 2
⊢ n = 2
theorem self_application : 1 = 0 := by
rw [self_application]
Lean error:
fail to show termination for
self_application
with errors
failed to infer structural recursion:
no parameters suitable for structural recursion
well-founded recursion cannot be used, 'self_application' does not take any (non-fixed) arguments
Here are some minimal examples of the "Accept False Proof" bugs reported by @RexWzh, together with errors shown in Lean. All of these proofs pass as valid in REPL's tactic mode
@Kripner Consider using our tool: https://github.com/trishullab/itp-interface
We recently introduced enforce done mode that ensures that even with bugs in interaction framework a wrong proof doesn't get accepted.
Tests corresponding to the various minimal cases mentioned in @Kripner message have been added to #63. So far, it seems that only the following self-application example is not caught:
theorem ex : False := by
exact?
If you find other issues, feel free to share them here. Any help is welcome if you have any idea on how to solve this remaining case.
@amit9oct itp-interface uses REPL's command mode to avoid false-positives, which is resource-intensive in terms of memory usage. It is necessary to fix potential bugs in tactic mode.
Consider this example from MiniF2F:
theorem amc12a_2003_p1 (u v : ℕ → ℕ) (h₀ : ∀ n, u n = 2 _n + 2) (h₁ : ∀ n, v n = 2_ n + 1) :
((∑ k in Finset.range 2003, u k) - ∑ k in Finset.range 2003, v k) = 2003 := by
simp only [h₀, h₁, Finset.sum_range_succ, Finset.sum_range_zero]
rfl
Command Mode: It consumes about 170GB of memory and takes 1,236 seconds to run.
{"cmd": "import Mathlib\nset_option maxHeartbeats 0\nset_option maxRecDepth 100000"}
{"cmd": "theorem amc12a_2003_p1 (u v : ℕ → ℕ) (h₀ : ∀ n, u n = 2 _n + 2) (h₁ : ∀ n, v n = 2_ n + 1) :\n ((∑ k in Finset.range 2003, u k) - ∑ k in Finset.range 2003, v k) = 2003 := by", "env": 0}
{"cmd": "theorem amc12a_2003_p1 (u v : ℕ → ℕ) (h₀ : ∀ n, u n = 2 _n + 2) (h₁ : ∀ n, v n = 2_ n + 1) :\n ((∑ k in Finset.range 2003, u k) - ∑ k in Finset.range 2003, v k) = 2003 := by\nsimp only [h₀, h₁, Finset.sum_range_succ, Finset.sum_range_zero]", "env": 0}
{"cmd": "theorem amc12a_2003_p1 (u v : ℕ → ℕ) (h₀ : ∀ n, u n = 2 * n + 2) (h₁ : ∀ n, v n = 2 * n + 1) :\n ((∑ k in Finset.range 2003, u k) - ∑ k in Finset.range 2003, v k) = 2003 := by\nsimp only [h₀, h₁, Finset.sum_range_succ, Finset.sum_range_zero]\nrfl", "env": 0}
Tactic Mode:
However, the same problem is solved in seconds using tactic mode.
{"cmd": "import Mathlib\nset_option maxHeartbeats 0\nset_option maxRecDepth 100000"}
{"cmd": "theorem amc12a_2003_p1 (u v : ℕ → ℕ) (h₀ : ∀ n, u n = 2 _n + 2) (h₁ : ∀ n, v n = 2_ n + 1) :\n ((∑ k in Finset.range 2003, u k) - ∑ k in Finset.range 2003, v k) = 2003 := by sorry", "env": 0}
{"tactic": "simp only [h₀, h₁, Finset.sum_range_succ, Finset.sum_range_zero]", "proofState": 0}
{"tactic": "rfl", "proofState": 1}
@RexWzh this example is interesting, and #63 incorrectly rejects it during kernel check. It seems related to the other problem you mentioned earlier in this discussion about set options that are not taken into account in tactic mode.
I decided to try your problem in Lean on live.lean-lang.org. It turns out that the kernel check is what takes time, hence the 1,236s in the command mode. In tactic mode, the kernel check is almost instant in this case because it doesn't seem to take into account set_option maxHeartbeats 0\nset_option maxRecDepth 100000.
For the "set options" part, it appears that maxHeartbeats is not working as expected, while maxRecDepth works correctly. You can check the behavior in this example.
As for "Placeholder Synthesis," Pantograph handles this well with their approach:
When Lean LSP says “don’t know how to synthesize placeholder”, this indicates the human operator needs to manually move the cursor to the placeholder and type in the correct expression. This error therefore should not halt the proof process, and the placeholder should be turned into a goal.
This approach makes more sense, though I haven't workout how to port this functionality to REPL.
@amit9oct
itp-interfaceuses REPL's command mode to avoid false-positives, which is resource-intensive in terms of memory usage. It is necessary to fix potential bugs in tactic mode.
@RexWzh Once you start using tactic mode for longer the memory will bloat eventually. In fact, my first attempt of making a python tool based on REPL was using tactic mode. The high memory is because the way REPL keeps clones of env to enable backtracking and has nothing to do with itp-interface. In fact, in ITP interface you can reduce the memory bloat by choosing to reduce the number of steps a REPL process will be kept alive in itp-interface. There are other things built in which allow you to parallely run proof search. There is a slight cost to pay for getting ray actor which I think is bearable. However, since proof search and training happen at scale, I don't think that is a big cost to pay. REPL as a light weight scalable JRPC is a good starting point, but it might be difficult to bake in all complexity which comes in when you start writing proofs, specially knowing when the proof finishes. IMO, use of local machines like laptops to do proof search is not a very pressing usecase (even for mathematicians). Someone rather run something powerful on a server and get the proofs. Also itp-interface loads all definitions and imports from the file till the point of the theorem you are interested and it does it once, so the comparison you are making is not apples to apples (because you can potentially run multiple theorem on the same instance). Your comparison is between vanilla cmd and tactic mode, not itp-interface and equivalent wrapper written in tactic mode.
However, by no means the bugs should not be fixed. I never suggested bugs should not be fixed (not sure where you got that from). In fact, I was the first one who noted these problems years ago and reported them (even opened this issue). Based on my understanding, individual tactics with so many possible semantics involved (thanks to the beautiful Lean language), I don't think the fix is easy. I'm just suggesting people an alternative which can be helpful meanwhile this gets fixed (trying to save some time for people working on these problems in the community).
Sorry for my previous phrasing. I presented the example merely to illustrate the importance of tactic mode.
ITP-interface is an excellent project that I've been exploring for a while. Thank you for providing detailed explanations whenever I had questions about the project.
Command mode typically operates in O(N) time, while tactic mode operates in O(1), making tactic mode potentially crucial for longer proofs. While command mode does offer advantages in terms of parallel processing capabilities, both modes have their distinct benefits.
Command mode typically operates in O(N) time, while tactic mode operates in O(1), making tactic mode potentially crucial for longer proofs. While command mode does offer advantages in terms of parallel processing capabilities, both modes have their distinct benefits.
I completely agree with you, there can be usecases where it is helpful (specially if someone is operating with a low compute budget).
I think it is also important to keep in mind that Lean REPL is a general tool. While most of us use it in an AI4Math context, there are other use cases such as lean4_jupyter. In such projects, people are probably interested in having an efficient tool capable of running on laptops.
@RexWzh this example is interesting, and #63 incorrectly rejects it during kernel check. It seems related to the other problem you mentioned earlier in this discussion about set options that are not taken into account in tactic mode. I decided to try your problem in Lean on live.lean-lang.org. It turns out that the kernel check is what takes time, hence the 1,236s in the command mode. In tactic mode, the kernel check is almost instant in this case because it doesn't seem to take into account
set_option maxHeartbeats 0\nset_option maxRecDepth 100000.
@augustepoiroux Actually, I don't think the lean web can run this proof, as it requires nearly 170G of memory! 😆 While this isn't a good proof example, it serves as an interesting illustration.
You can observe this behavior by running lake env lean test.lean in your shell. The memory usage grows significantly:
I'm particularly curious about why meta-programming approaches generally tend to be more efficient and consume less memory.
@RexWzh Just pushed a new PR #82 which should fix the set_config issues. Your miniF2F example makes the memory explodes just like in command mode.
Just found a false negative proof of tactic mode:
example : true := by
let p : ∃ x : Nat, x > 0 := ⟨1, Nat.one_pos⟩
let ⟨n, hn⟩ := p
rfl
Commands to reproduce:
{"cmd": "example : true := by sorry"}
{"tactic": "let p : ∃ x : Nat, x > 0 := ⟨1, Nat.one_pos⟩", "proofState": 0}
{"tactic": "let ⟨n, hn⟩ := p", "proofState": 1}
Error message: "auxiliary declaration cannot be created when declaration name is not available"
@RexWzh Which commit did you use for this example? On the master branch, when I try your example:
{"cmd": "example : true := by sorry"}
{"tactic": "let p : ∃ x : Nat, x > 0 := ⟨1, Nat.one_pos⟩", "proofState": 0}
{"tactic": "let ⟨n, hn⟩ := p", "proofState": 1}
{"tactic": "rfl", "proofState": 2}
I get a correct output:
{"sorries":
[{"proofState": 0,
"pos": {"line": 1, "column": 21},
"goal": "⊢ true = true",
"endPos": {"line": 1, "column": 26}}],
"messages":
[{"severity": "warning",
"pos": {"line": 1, "column": 0},
"endPos": {"line": 1, "column": 7},
"data": "declaration uses 'sorry'"}],
"env": 0}
{"proofStatus": "Incomplete: open goals remain",
"proofState": 1,
"goals": ["p : ∃ x, x > 0 := Exists.intro 1 Nat.one_pos\n⊢ true = true"]}
{"proofStatus": "Incomplete: open goals remain",
"proofState": 2,
"goals":
["p : ∃ x, x > 0 := Exists.intro 1 Nat.one_pos\nn : Nat\nhn : n > 0\n⊢ true = true"]}
{"proofStatus": "Completed", "proofState": 3, "goals": []}