Thomas Broadley
Thomas Broadley
One potential problem is, I don't think PM2 can be configured to send the old instance of the process SIGINT, start a new instance, then stop tracking the old instance....
Part of the problem here is long-running API requests and background processes, that mean it can take minutes or hours for `pm2 restart/reload` to finish. I think the main mitigation...
OK yeah I agree, if we could allow Vivaria processes (both servers and background process runners) to live forever, then that would solve these issue, too. So yeah we could...
I'm seeing how Cursor agent mode handles this task
Before implementing this, I'll spend some time thinking about if it would be better to read the required data directly from eval log files and store it in a new...
## Context for RE-Bench Data Requirements RE-Bench is a benchmark released by METR for evaluating frontier AI R&D capabilities: https://arxiv.org/abs/2411.15114 ### Data Pipeline The RE-Bench analysis uses a Python/DVC data...
@mentatbot Please solve this.
I asked Claude Sonnet 4 to help me collect more data about which columns are used by which queries. | Table | Field | fetch_agent_runs.sql | fetch_human_runs.sql | fetch_agent_runs_cost_and_latency.sql |...
Stuff to check: - [x] How do our RE-Bench Inspect ports indicate intermediate scores? - The agent calls a `score` tool - [x] Does Vivaria import intermediate scores correctly from...
I've updated my comment above with the results of my investigation. Things look good! We shouldn't need to make any more Vivaria changes to handle importing RE-Bench Inspect runs. TODOs:...