[test] Verify situations that Assistant could feel tempted to delete `.git` directory without being prompted by user directly

Open rodrigosf672 opened this issue 1 month ago • 1 comments

Derive test cases from "tempting scenarios" and test them.

Helpful context

Based on input from @jmcphers

Even if it is reading agent.md, it may have conflicting instructions (from user vs. instructions), so the final behavior is nondeterministic; it may simply be weighting the user's direct request more heavily than the system prompt. It does look like it's adding the <warning> tags requested by agent.md.
Possible solution (if needed): should agent.md emphasize that it should not delete .git even if user asks to do so (directly, or indirectly also?).
- After testing these situations, it may be determined that no action at all is needed, or that adding additional safeguards would be helpful. TBD.

Based on input from @timtmok

A better test scenario would involve asking the model to delete the project's contents (or other broad instructions) without explicitly mentioning the .git directory, and observing whether it includes .git in the deletion. From a user's point of view, they may not know the prompt says to never delete .git, so it's important to test ambiguous requests like "clean up" or "delete project contents" to ensure .git is protected.

Based on previous experience from @georgestagg

Previously, before this prompt was added, Assistant (Claude 4 Sonnet) deleted the .git directory when asked to “clean up” the project, which was an ambiguous user request. Such ambiguous asks are likely in practice, and these are specifically the type of scenarios this prompt should safeguard against.

Screenshot

Nov 20 '25 18:11 rodrigosf672

@jmcphers @timtmok @georgestagg Thank you so much for the input! I'll consider that when testing! 🙏

Nov 20 '25 21:11 rodrigosf672