repl [BUG] REPL accepts incorrect proofs

trafficstars

I found a peculiar bug in REPL, where it can accept any proof that applies the theorem itself.

Example:

{"cmd": "theorem test (p q : Prop) (hp : p) (hq : q) : p ∧ q ∧ p := by sorry"}

{"sorries":
 [{"proofState": 0,
   "pos": {"line": 1, "column": 62},
   "goal": "p q : Prop\nhp : p\nhq : q\n⊢ p ∧ q ∧ p",
   "endPos": {"line": 1, "column": 67}}],
 "messages":
 [{"severity": "warning",
   "pos": {"line": 1, "column": 8},
   "endPos": {"line": 1, "column": 12},
   "data": "declaration uses 'sorry'"}],
 "env": 0}

{"tactic": "apply test", "proofState": 0}

{"proofState": 1,
 "goals":
 ["case hp\np q : Prop\nhp : p\nhq : q\n⊢ p",
  "case hq\np q : Prop\nhp : p\nhq : q\n⊢ q"]}

In the above example, we cannot complete the proof by applying itself. However, REPL does not raise any error messages. However, on VS Code IDE for Lean 4, we get the error message for the proof below:

theorem test (p q : Prop) (hp : p) (hq : q) : p ∧ q ∧ p := by
    apply test
    exact hp
    exact hq

Expected Error message:

fail to show termination for
  Lean4Proj1.test
with errors
structural recursion cannot be used

Could not find a decreasing measure.
The arguments relate at each recursive call as follows:
(<, ≤, =: relation proved, ? all proofs failed, _: no proof attempted)
          hp hq
1) 5:8-12  _  _
Please use `termination_by` to specify a decreasing measure.

May 29 '24 19:05 amit9oct

Your two examples are not the same. Did you try attempting to complete the proof using exact hp; exact hq in the REPL?

It is expected that calling apply test does not result in an immediate error. This is a valid proof strategy for proofs by induction, and the termination check only runs once the definition or proof is complete.

May 30 '24 05:05 digama0

I forgot to paste the full output

{"cmd": "theorem test (p q : Prop) (hp : p) (hq : q) : p ∧ q ∧ p := by sorry"}

{"sorries":
 [{"proofState": 0,
   "pos": {"line": 1, "column": 62},
   "goal": "p q : Prop\nhp : p\nhq : q\n⊢ p ∧ q ∧ p",
   "endPos": {"line": 1, "column": 67}}],
 "messages":
 [{"severity": "warning",
   "pos": {"line": 1, "column": 8},
   "endPos": {"line": 1, "column": 12},
   "data": "declaration uses 'sorry'"}],
 "env": 0}

{"tactic": "apply test", "proofState": 0}

{"proofState": 1,
 "goals":
 ["case hp\np q : Prop\nhp : p\nhq : q\n⊢ p",
  "case hq\np q : Prop\nhp : p\nhq : q\n⊢ q"]}

{"tactic": "exact hp", "proofState": 1}

{"proofState": 2, "goals": ["case hq\np q : Prop\nhp : p\nhq : q\n⊢ q"]}

{"tactic": "exact hq", "proofState": 2}

{"proofState": 3, "goals": []}

As you can there is no error message here. Also NOTE: The VS Code IDE shows the error message as soon as we use apply test tactic.

May 30 '24 05:05 amit9oct

Also exact hp; exact hq won't work in REPL because it doesn't support tactical (;)

May 30 '24 05:05 amit9oct

Is this actually a complete proof though? I'm not super familiar with the REPL API but I would expect some kind of qed-equivalent to complete the proof once all the goals are finished. It may well be that this is a (design) bug in the REPL in that it has no way to express errors that occur when entering the definition into the environment.

PS: ; isn't a tactical, it is just a list of tactics. You have to pass them separately with the REPL API since it takes single tactics, or enclose it in parentheses like (exact hp; exact hq).

May 30 '24 05:05 digama0

I just followed the example shown in the README (https://github.com/leanprover-community/repl/blob/master/README.md), I may be wrong, but it doesn't look like there is something like qed from the examples.

May 30 '24 06:05 amit9oct

At present there is nothing you can do with a completed proof state: we would like to extend this so that you can replace the original sorry with your tactic script, and obtain the resulting Environment.

So apparently QED is not implemented yet. If it was, this is the step that would give the error.

May 30 '24 06:05 digama0

Hi, Happy new year!

So, unfortunately, in this context, it seems that this makes the exact? tactic slightly dangerous to use as it will "prove" anything:

{"cmd": "theorem ex : False := sorry"}

{"sorries":
 [{"proofState": 0,
   "pos": {"line": 1, "column": 22},
   "goal": "⊢ False",
   "endPos": {"line": 1, "column": 27}}],
 "messages":
 [{"severity": "warning",
   "pos": {"line": 1, "column": 8},
   "endPos": {"line": 1, "column": 10},
   "data": "declaration uses 'sorry'"}],
 "env": 0}

{"proofState": 0, "tactic": "exact?"}

{"proofState": 1,
 "messages":
 [{"severity": "info",
   "pos": {"line": 0, "column": 0},
   "endPos": {"line": 0, "column": 0},
   "data": "Try this: exact _root_.ex"}],
 "goals": []}

exact? is referring to the theorem itself to prove it as suggested by "Try this: exact _root_.ex".

Jan 01 '25 13:01 augustepoiroux

The exact? case is handled here:

https://github.com/leanprover-community/repl/commit/66515025d9ccffa2be29718d31b816edd0675202

However, it only works if the theorem is initialized in an existing environment, e.g.

{"cmd": "#eval 1 + 1"}

{"cmd": "theorem ex : False := sorry", "env": 0}

{"proofState": 0, "tactic": "exact?"}

exact ex still works here. I guess for recursion it resolves from somewhere else than ctx.env (resolveLocalName it seems?)

~The question is, why does the InfoTree from elaborating the definition have as its root a ContextInfo with an Environment that includes in constants the definition itself, if it is not necessary for recursion and messes with exact?? It becomes an issue when the definition creates e.g. foo and foo.match_1 and you want only foo.match_1 to be in the env of the sorry. As it is currently, https://github.com/leanprover-community/repl/commit/66515025d9ccffa2be29718d31b816edd0675202 ensures that neither foo nor foo.match_1 are in the env.~ #66

Jan 08 '25 07:01 llllvvuu

Interesting, thanks! Having the definition in constants is probably useful/necessary for the pickle feature.

Jan 08 '25 09:01 augustepoiroux

I found another example where the REPL accepts an incorrect proof:

amc12a_2002_p12 from minif2f.

import Mathlib
theorem amc12a_2002_p12 (f : ℝ → ℝ) (k : ℝ) (a b : ℕ) (h₀ : ∀ x, f x = x ^ 2 - 63 * x + k)
    (h₁ : f a = 0 ∧ f b = 0) (h₂ : a ≠ b) (h₃ : Nat.Prime a ∧ Nat.Prime b) : k = 122 := by
  apply (mul_right_inj' (sub_ne_zero.2 ?_)).1
  ring_nf at h₁ h₂ ⊢

This can be reproduced with the following REPL commands:

{"cmd": "import Mathlib"}

{"cmd": "theorem amc12a_2002_p12 (f : ℝ → ℝ) (k : ℝ) (a b : ℕ) (h₀ : ∀ x, f x = x ^ 2 - 63 * x + k)\n    (h₁ : f a = 0 ∧ f b = 0) (h₂ : a ≠ b) (h₃ : Nat.Prime a ∧ Nat.Prime b) : k = 122 := by sorry", "env": 0}

{"tactic": "apply (mul_right_inj' (sub_ne_zero.2 ?_)).1", "proofState": 0}

{"tactic": "ring_nf at h₁ h₂ ⊢", "proofState": 1}

I've tested on PR #63, but it still fails to catch the invalid proofs.

Feb 20 '25 08:02 RexWzh

I found another example where the REPL accepts an incorrect proof:

amc12a_2002_p12 from minif2f.

import Mathlib
theorem amc12a_2002_p12 (f : ℝ → ℝ) (k : ℝ) (a b : ℕ) (h₀ : ∀ x, f x = x ^ 2 - 63 * x + k)
    (h₁ : f a = 0 ∧ f b = 0) (h₂ : a ≠ b) (h₃ : Nat.Prime a ∧ Nat.Prime b) : k = 122 := by
  apply (mul_right_inj' (sub_ne_zero.2 ?_)).1
  ring_nf at h₁ h₂ ⊢

This can be reproduced with the following REPL commands:

{"cmd": "import Mathlib"}

{"cmd": "theorem amc12a_2002_p12 (f : ℝ → ℝ) (k : ℝ) (a b : ℕ) (h₀ : ∀ x, f x = x ^ 2 - 63 * x + k)\n    (h₁ : f a = 0 ∧ f b = 0) (h₂ : a ≠ b) (h₃ : Nat.Prime a ∧ Nat.Prime b) : k = 122 := by sorry", "env": 0}

{"tactic": "apply (mul_right_inj' (sub_ne_zero.2 ?_)).1", "proofState": 0}

{"tactic": "ring_nf at h₁ h₂ ⊢", "proofState": 1}

I've tested on PR #63, but it still fails to catch the invalid proofs.

Hello, I have encountered the same issue. Have you resolved it now?

Mar 03 '25 06:03 wuseguang

You can use our tool: https://github.com/trishullab/itp-interface

It is based on REPL but doesn't accept wrong proofs. It also allows parallel proof execution, handling memory issues which might arise when you use too many tactics in your proof while using REPL, and a lot of other fixes. It is in Python and can run parallely on ray clusters.

Mar 03 '25 06:03 amit9oct

You can use our tool: https://github.com/trishullab/itp-interface

It is based on REPL but doesn't accept wrong proofs. It also allows parallel proof execution, handling memory issues which might arise when you use too many tactics in your proof while using REPL, and a lot of other fixes. It is in Python and can run parallely on ray clusters.

Hi, @amit9oct , I run the repl at https://github.com/amit9oct/repl with lean4 version v4.16.0. The commads is: {"cmd": "import Mathlib"}

{"cmd": "theorem amc12a_2002_p12 (f : ℝ → ℝ) (k : ℝ) (a b : ℕ) (h₀ : ∀ x, f x = x ^ 2 - 63 * x + k)\n (h₁ : f a = 0 ∧ f b = 0) (h₂ : a ≠ b) (h₃ : Nat.Prime a ∧ Nat.Prime b) : k = 122 := by sorry", "env": 0}

{"tactic": "apply (mul_right_inj' (sub_ne_zero.2 ?_)).1", "proofState": 0}

{"tactic": "ring_nf at h₁ h₂ ⊢", "proofState": 1}

but it still accept wrong proofs.

Mar 03 '25 08:03 wuseguang

No use the python interface and not the native REPL to run your code. See the README https://github.com/trishullab/itp-interface (we have also tested our code on miniF2F and it works). The key is to not use tactic mode at all

Mar 03 '25 08:03 amit9oct

Our python code uses REPL but only for barebones Lean 4 support, we don't use pickling or tactic mode which might be prone to bugs

Mar 03 '25 08:03 amit9oct

To summarize, there are four bug cases in tactic mode.

Accept False Proof

1. Application Type Mismatch

import Mathlib
theorem demo (x : ℕ) : ∃x, ∃ g : (∀ n : ℕ, ¬Nat.Prime x), ¬x ≤ x := by
use x
have g6 : 2 % 2 < 2 := by 
    refine Nat.mod_lt 2 ?h
    simp

2. Placeholder Synthesis

theorem amc12a_2002_p12 (f : ℝ → ℝ) (k : ℝ) (a b : ℕ) 
    (h₀ : ∀ x, f x = x ^ 2 - 63 * x + k)
    (h₁ : f a = 0 ∧ f b = 0) (h₂ : a ≠ b) 
    (h₃ : Nat.Prime a ∧ Nat.Prime b) : k = 122 := by
  apply (mul_right_inj' (sub_ne_zero.2 ?_)).1
  ring_nf at h₁ h₂ ⊢

3. Self-Application

theorem imo2024_p2 :
    {(a, b) | 0 < a ∧ 0 < b ∧ 
     ∃ g N, 0 < g ∧ 0 < N ∧ 
     ∀ n ≥ N, Nat.gcd (a ^ n + b) (b ^ n + a) = g}
    = {(1, 1)} := by 
  ext n
  symm
  rw [imo2024_p2]

Reject Valid Proof

4. Set Option Failures

import Mathlib
set_option maxHeartbeats 0
set_option maxRecDepth 1000000
theorem mathd_numbertheory_314 (r n : ℕ) 
    (h₀ : r = 1342 % 13) (h₁ : 0 < n) 
    (h₂ : 1342 ∣ n) (h₃ : n % 13 < r) : 
    6710 ≤ n := by
contrapose! h₃
interval_cases n <;> simp_all (config := {decide := true})

Mar 03 '25 15:03 RexWzh

Here are some minimal examples of the "Accept False Proof" bugs reported by @RexWzh, together with errors shown in Lean. All of these proofs pass as valid in REPL's tactic mode.

example : 1 = 0 := by
  cases 1
  rfl
  apply ?succ

Lean error:

(kernel) declaration has metavariables '_example'

example : 1 = 0 := by
  cases 1
  rfl
  have: true := by
    apply ?succ
    trivial

Lean error:

application type mismatch
  let_fun this := of_decide_eq_true (id (Eq.refl true));
  of_decide_eq_true (id (Eq.refl true))
argument has type
  true = true → true = true
but function has type
  (∀ (x : true = true), (fun this ↦ n✝ + 1 = 0) x) → (fun this ↦ n✝ + 1 = 0) ⋯

example : 1 = 0 := by
  cases' exists_ne 2 with n h
  simpa [] using h _

Lean error:

don't know how to synthesize placeholder
context:
n : ℕ
h : n ≠ 2
⊢ n = 2

theorem self_application : 1 = 0 := by
  rw [self_application]

Lean error:

fail to show termination for
  self_application
with errors
failed to infer structural recursion:
no parameters suitable for structural recursion

well-founded recursion cannot be used, 'self_application' does not take any (non-fixed) arguments

Apr 09 '25 11:04 Kripner

Here are some minimal examples of the "Accept False Proof" bugs reported by @RexWzh, together with errors shown in Lean. All of these proofs pass as valid in REPL's tactic mode

@Kripner Consider using our tool: https://github.com/trishullab/itp-interface

We recently introduced enforce done mode that ensures that even with bugs in interaction framework a wrong proof doesn't get accepted.

Apr 09 '25 14:04 amit9oct

Tests corresponding to the various minimal cases mentioned in @Kripner message have been added to #63. So far, it seems that only the following self-application example is not caught:

theorem ex : False := by
   exact?

If you find other issues, feel free to share them here. Any help is welcome if you have any idea on how to solve this remaining case.

Apr 09 '25 15:04 augustepoiroux

@amit9oct itp-interface uses REPL's command mode to avoid false-positives, which is resource-intensive in terms of memory usage. It is necessary to fix potential bugs in tactic mode.

Consider this example from MiniF2F:

theorem amc12a_2003_p1 (u v : ℕ → ℕ) (h₀ : ∀ n, u n = 2 _n + 2) (h₁ : ∀ n, v n = 2_ n + 1) :
    ((∑ k in Finset.range 2003, u k) - ∑ k in Finset.range 2003, v k) = 2003 := by
simp only [h₀, h₁, Finset.sum_range_succ, Finset.sum_range_zero]
rfl

Command Mode: It consumes about 170GB of memory and takes 1,236 seconds to run.

{"cmd": "import Mathlib\nset_option maxHeartbeats 0\nset_option maxRecDepth 100000"}

{"cmd": "theorem amc12a_2003_p1 (u v : ℕ → ℕ) (h₀ : ∀ n, u n = 2 _n + 2) (h₁ : ∀ n, v n = 2_ n + 1) :\n    ((∑ k in Finset.range 2003, u k) - ∑ k in Finset.range 2003, v k) = 2003 := by", "env": 0}

{"cmd": "theorem amc12a_2003_p1 (u v : ℕ → ℕ) (h₀ : ∀ n, u n = 2 _n + 2) (h₁ : ∀ n, v n = 2_ n + 1) :\n    ((∑ k in Finset.range 2003, u k) - ∑ k in Finset.range 2003, v k) = 2003 := by\nsimp only [h₀, h₁, Finset.sum_range_succ, Finset.sum_range_zero]", "env": 0}

{"cmd": "theorem amc12a_2003_p1 (u v : ℕ → ℕ) (h₀ : ∀ n, u n = 2 * n + 2) (h₁ : ∀ n, v n = 2 * n + 1) :\n    ((∑ k in Finset.range 2003, u k) - ∑ k in Finset.range 2003, v k) = 2003 := by\nsimp only [h₀, h₁, Finset.sum_range_succ, Finset.sum_range_zero]\nrfl", "env": 0}

Tactic Mode:

However, the same problem is solved in seconds using tactic mode.

{"cmd": "import Mathlib\nset_option maxHeartbeats 0\nset_option maxRecDepth 100000"}

{"cmd": "theorem amc12a_2003_p1 (u v : ℕ → ℕ) (h₀ : ∀ n, u n = 2 _n + 2) (h₁ : ∀ n, v n = 2_ n + 1) :\n    ((∑ k in Finset.range 2003, u k) - ∑ k in Finset.range 2003, v k) = 2003 := by sorry", "env": 0}

{"tactic": "simp only [h₀, h₁, Finset.sum_range_succ, Finset.sum_range_zero]", "proofState": 0}

{"tactic": "rfl", "proofState": 1}

Apr 10 '25 08:04 RexWzh

@RexWzh this example is interesting, and #63 incorrectly rejects it during kernel check. It seems related to the other problem you mentioned earlier in this discussion about set options that are not taken into account in tactic mode. I decided to try your problem in Lean on live.lean-lang.org. It turns out that the kernel check is what takes time, hence the 1,236s in the command mode. In tactic mode, the kernel check is almost instant in this case because it doesn't seem to take into account set_option maxHeartbeats 0\nset_option maxRecDepth 100000.

Apr 10 '25 08:04 augustepoiroux

For the "set options" part, it appears that maxHeartbeats is not working as expected, while maxRecDepth works correctly. You can check the behavior in this example.

As for "Placeholder Synthesis," Pantograph handles this well with their approach:

When Lean LSP says “don’t know how to synthesize placeholder”, this indicates the human operator needs to manually move the cursor to the placeholder and type in the correct expression. This error therefore should not halt the proof process, and the placeholder should be turned into a goal.

This approach makes more sense, though I haven't workout how to port this functionality to REPL.

Apr 10 '25 08:04 RexWzh

@amit9oct itp-interface uses REPL's command mode to avoid false-positives, which is resource-intensive in terms of memory usage. It is necessary to fix potential bugs in tactic mode.

@RexWzh Once you start using tactic mode for longer the memory will bloat eventually. In fact, my first attempt of making a python tool based on REPL was using tactic mode. The high memory is because the way REPL keeps clones of env to enable backtracking and has nothing to do with itp-interface. In fact, in ITP interface you can reduce the memory bloat by choosing to reduce the number of steps a REPL process will be kept alive in itp-interface. There are other things built in which allow you to parallely run proof search. There is a slight cost to pay for getting ray actor which I think is bearable. However, since proof search and training happen at scale, I don't think that is a big cost to pay. REPL as a light weight scalable JRPC is a good starting point, but it might be difficult to bake in all complexity which comes in when you start writing proofs, specially knowing when the proof finishes. IMO, use of local machines like laptops to do proof search is not a very pressing usecase (even for mathematicians). Someone rather run something powerful on a server and get the proofs. Also itp-interface loads all definitions and imports from the file till the point of the theorem you are interested and it does it once, so the comparison you are making is not apples to apples (because you can potentially run multiple theorem on the same instance). Your comparison is between vanilla cmd and tactic mode, not itp-interface and equivalent wrapper written in tactic mode.

However, by no means the bugs should not be fixed. I never suggested bugs should not be fixed (not sure where you got that from). In fact, I was the first one who noted these problems years ago and reported them (even opened this issue). Based on my understanding, individual tactics with so many possible semantics involved (thanks to the beautiful Lean language), I don't think the fix is easy. I'm just suggesting people an alternative which can be helpful meanwhile this gets fixed (trying to save some time for people working on these problems in the community).

Apr 10 '25 08:04 amit9oct

Sorry for my previous phrasing. I presented the example merely to illustrate the importance of tactic mode.

ITP-interface is an excellent project that I've been exploring for a while. Thank you for providing detailed explanations whenever I had questions about the project.

Command mode typically operates in O(N) time, while tactic mode operates in O(1), making tactic mode potentially crucial for longer proofs. While command mode does offer advantages in terms of parallel processing capabilities, both modes have their distinct benefits.

Apr 10 '25 09:04 RexWzh

Command mode typically operates in O(N) time, while tactic mode operates in O(1), making tactic mode potentially crucial for longer proofs. While command mode does offer advantages in terms of parallel processing capabilities, both modes have their distinct benefits.

I completely agree with you, there can be usecases where it is helpful (specially if someone is operating with a low compute budget).

Apr 10 '25 09:04 amit9oct

I think it is also important to keep in mind that Lean REPL is a general tool. While most of us use it in an AI4Math context, there are other use cases such as lean4_jupyter. In such projects, people are probably interested in having an efficient tool capable of running on laptops.

Apr 10 '25 09:04 augustepoiroux

@RexWzh this example is interesting, and #63 incorrectly rejects it during kernel check. It seems related to the other problem you mentioned earlier in this discussion about set options that are not taken into account in tactic mode. I decided to try your problem in Lean on live.lean-lang.org. It turns out that the kernel check is what takes time, hence the 1,236s in the command mode. In tactic mode, the kernel check is almost instant in this case because it doesn't seem to take into account set_option maxHeartbeats 0\nset_option maxRecDepth 100000.

@augustepoiroux Actually, I don't think the lean web can run this proof, as it requires nearly 170G of memory! 😆 While this isn't a good proof example, it serves as an interesting illustration.

You can observe this behavior by running lake env lean test.lean in your shell. The memory usage grows significantly:

I'm particularly curious about why meta-programming approaches generally tend to be more efficient and consume less memory.

Apr 10 '25 09:04 RexWzh

@RexWzh Just pushed a new PR #82 which should fix the set_config issues. Your miniF2F example makes the memory explodes just like in command mode.

Apr 10 '25 14:04 augustepoiroux

Just found a false negative proof of tactic mode:

example : true := by
  let p : ∃ x : Nat, x > 0 := ⟨1, Nat.one_pos⟩  
  let ⟨n, hn⟩ := p
  rfl

Commands to reproduce:

{"cmd": "example : true := by sorry"}

{"tactic": "let p : ∃ x : Nat, x > 0 := ⟨1, Nat.one_pos⟩", "proofState": 0}

{"tactic": "let ⟨n, hn⟩ := p", "proofState": 1}

Error message: "auxiliary declaration cannot be created when declaration name is not available"

Apr 17 '25 21:04 RexWzh

@RexWzh Which commit did you use for this example? On the master branch, when I try your example:

{"cmd": "example : true := by sorry"}

{"tactic": "let p : ∃ x : Nat, x > 0 := ⟨1, Nat.one_pos⟩", "proofState": 0}

{"tactic": "let ⟨n, hn⟩ := p", "proofState": 1}

{"tactic": "rfl", "proofState": 2}

I get a correct output:

{"sorries":
 [{"proofState": 0,
   "pos": {"line": 1, "column": 21},
   "goal": "⊢ true = true",
   "endPos": {"line": 1, "column": 26}}],
 "messages":
 [{"severity": "warning",
   "pos": {"line": 1, "column": 0},
   "endPos": {"line": 1, "column": 7},
   "data": "declaration uses 'sorry'"}],
 "env": 0}

{"proofStatus": "Incomplete: open goals remain",
 "proofState": 1,
 "goals": ["p : ∃ x, x > 0 := Exists.intro 1 Nat.one_pos\n⊢ true = true"]}

{"proofStatus": "Incomplete: open goals remain",
 "proofState": 2,
 "goals":
 ["p : ∃ x, x > 0 := Exists.intro 1 Nat.one_pos\nn : Nat\nhn : n > 0\n⊢ true = true"]}

{"proofStatus": "Completed", "proofState": 3, "goals": []}

Apr 21 '25 12:04 augustepoiroux

repl repl copied to clipboard

[BUG] REPL accepts incorrect proofs

Accept False Proof

1. Application Type Mismatch

2. Placeholder Synthesis

3. Self-Application

Reject Valid Proof

4. Set Option Failures

repl
repl copied to clipboard