verifiers
verifiers copied to clipboard
Add Terminal-Bench sandbox example environment
Summary
- add a configurable SandboxEnv wrapper for Terminal-Bench tasks that stages assets in a fresh sandbox and runs the official tests during post-rollout
- cache the test outcome for scoring via a simple rubric and ship a single-example dataset for the requested task id
- document the environment and declare its dependencies
Testing
- uv run ruff check environments/terminal_bench/terminal_bench.py
https://chatgpt.com/codex/tasks/task_e_68ea1a482a388326932b1fb2fa1d666d