GPTWorld GPT-o1 or DeepSeek R1 can solve it, not sure GPT-4 can

This puzzle was made back when neither Open AI O1 or DeepSeek R1 existed and it is meant to be solved with "GPT-4". The author claims to have solved it with GPT-4, but I'm not sure how it's possible: GPT-4 can't even reliably solve the easy game (it's correct like ~50% of the launches). It's been 1 year 9+ months since the puzzle was published and the leaderboard is still empty.

Meanwhile, both DeepSeek R1 (I launched the prompt manually) and "GPT-o1" (model: "o1-preview") solved the first three games (easy, medium, hard) first attempt. DeepSeek actually took quite a lot of time thinking to solve it:

Easy: 43 seconds
Medium: 119 seconds
Hard: 196 seconds

Unfortunately, neither model, could solve the "Evil" game. R1 took 508 seconds of thinking and in its last thoughts mentioned that "I'm not certain". I tried to explain BFS to R1, but it only took my solution suggestion half seriously and juggled BFS attempts and its own bruteforce approaches for 393 more seconds. And (after I tried to insist on using BFS) again for 292 seconds ... again without being able to fully focus on BFS ("Let's try a different approach. Let me think again.")

My token score on "Hard" with O1 is:

Input Tokens: 765
Output Tokens: 272

although it's not a direct comparison with GPT-4, of course.

I'm probably done with this puzzle, but I'm very curious: what is author's prompt, which managed to get it working on GPT-4?

P.S.: for those who want to try it: be prepared to dig through authors code (needs adjustments to work reasonably) and top up $6 of Open AI API access (can't even access GPT-4o model for free through the API).

Sidenote: seeing DeepSeek R1 trying to take various "shortcuts" solving the "Evil" game and failing miserably, instead of simply and methodically executing BFS, makes me think of when I'm doing a similar thing in real life: trying to take various semi-random "shortcuts" instead of focusing methodically on some specific method/direction.

Feb 10 '25 20:02 nns2009

Nice! Yes this is exactly where I got to as well. I was pretty surprised that current models fail on the evil puzzle. Thought this was the kind of thing that "thinking" models should do well on.

Code rotted a bit over the last year as all the APIs changed. Feel to fork it if you want to clean it up.

GPT-4 solved it pretty stochastically but it was able to do hard 1/4 times or so.

Feb 11 '25 02:02 srush

Thanks for the response! I was wondering if there is some kind of "secret prompting technique" which got GPT-4 to work. Having it solve hard 1/4 times is still very surprising! Do you still have that prompt?

I just re-launched GPT-4 on Medium ~10 times and it couldn't solve it a single time (although it does pick up the key often). I was thinking that adding updated position as a comment for each line of the example (to make AI do the same and for it to indirectly act as a memory of some sort for "non-thinking" models) would help, but it didn't (at least: not enough).

Feb 11 '25 14:02 nns2009

I will try to find the prompt. It was a very involved with lot of examples. Nothing secret.

One thing that was really cool was that when I tried it with Gemini Flash the CoT was literally a search exploration path. Does R1 do that too? Would love to see your raw output on expert if you have it.

Feb 11 '25 16:02 srush