GPT-o1 or DeepSeek R1 can solve it, not sure GPT-4 can
This puzzle was made back when neither Open AI O1 or DeepSeek R1 existed and it is meant to be solved with "GPT-4". The author claims to have solved it with GPT-4, but I'm not sure how it's possible: GPT-4 can't even reliably solve the easy game (it's correct like ~50% of the launches). It's been 1 year 9+ months since the puzzle was published and the leaderboard is still empty.
Meanwhile, both DeepSeek R1 (I launched the prompt manually) and "GPT-o1" (model: "o1-preview") solved the first three games (easy, medium, hard) first attempt. DeepSeek actually took quite a lot of time thinking to solve it:
- Easy: 43 seconds
- Medium: 119 seconds
- Hard: 196 seconds
Unfortunately, neither model, could solve the "Evil" game. R1 took 508 seconds of thinking and in its last thoughts mentioned that "I'm not certain". I tried to explain BFS to R1, but it only took my solution suggestion half seriously and juggled BFS attempts and its own bruteforce approaches for 393 more seconds. And (after I tried to insist on using BFS) again for 292 seconds ... again without being able to fully focus on BFS ("Let's try a different approach. Let me think again.")
My token score on "Hard" with O1 is:
- Input Tokens: 765
- Output Tokens: 272
although it's not a direct comparison with GPT-4, of course.
I'm probably done with this puzzle, but I'm very curious: what is author's prompt, which managed to get it working on GPT-4?
P.S.: for those who want to try it: be prepared to dig through authors code (needs adjustments to work reasonably) and top up $6 of Open AI API access (can't even access GPT-4o model for free through the API).
Sidenote: seeing DeepSeek R1 trying to take various "shortcuts" solving the "Evil" game and failing miserably, instead of simply and methodically executing BFS, makes me think of when I'm doing a similar thing in real life: trying to take various semi-random "shortcuts" instead of focusing methodically on some specific method/direction.
Nice! Yes this is exactly where I got to as well. I was pretty surprised that current models fail on the evil puzzle. Thought this was the kind of thing that "thinking" models should do well on.
Code rotted a bit over the last year as all the APIs changed. Feel to fork it if you want to clean it up.
GPT-4 solved it pretty stochastically but it was able to do hard 1/4 times or so.
Thanks for the response! I was wondering if there is some kind of "secret prompting technique" which got GPT-4 to work. Having it solve hard 1/4 times is still very surprising! Do you still have that prompt?
I just re-launched GPT-4 on Medium ~10 times and it couldn't solve it a single time (although it does pick up the key often). I was thinking that adding updated position as a comment for each line of the example (to make AI do the same and for it to indirectly act as a memory of some sort for "non-thinking" models) would help, but it didn't (at least: not enough).
I will try to find the prompt. It was a very involved with lot of examples. Nothing secret.
One thing that was really cool was that when I tried it with Gemini Flash the CoT was literally a search exploration path. Does R1 do that too? Would love to see your raw output on expert if you have it.