Thomas Broadley comments

Results 109 comments of


                                            Thomas Broadley

If Vivaria restarts during a run's initial intermediate scoring, then run gets a fatal error

One potential problem is, I don't think PM2 can be configured to send the old instance of the process SIGINT, start a new instance, then stop tracking the old instance....

If Vivaria restarts during a run's initial intermediate scoring, then run gets a fatal error

Part of the problem here is long-running API requests and background processes, that mean it can take minutes or hours for `pm2 restart/reload` to finish. I think the main mitigation...

If Vivaria restarts during a run's initial intermediate scoring, then run gets a fatal error

OK yeah I agree, if we could allow Vivaria processes (both servers and background process runners) to live forever, then that would solve these issue, too. So yeah we could...

If Vivaria restarts during a run's initial intermediate scoring, then run gets a fatal error

I'm seeing how Cursor agent mode handles this task

Ensure Inspect importer imports all data required for RE-Bench analysis pipeline

Before implementing this, I'll spend some time thinking about if it would be better to read the required data directly from eval log files and store it in a new...

Ensure Inspect importer imports all data required for RE-Bench analysis pipeline

## Context for RE-Bench Data Requirements RE-Bench is a benchmark released by METR for evaluating frontier AI R&D capabilities: https://arxiv.org/abs/2411.15114 ### Data Pipeline The RE-Bench analysis uses a Python/DVC data...

Ensure Inspect importer imports all data required for RE-Bench analysis pipeline

@mentatbot Please solve this.

Ensure Inspect importer imports all data required for RE-Bench analysis pipeline

Ensure Inspect importer imports all data required for RE-Bench analysis pipeline

Stuff to check: - [x] How do our RE-Bench Inspect ports indicate intermediate scores? - The agent calls a `score` tool - [x] Does Vivaria import intermediate scores correctly from...

Ensure Inspect importer imports all data required for RE-Bench analysis pipeline

I've updated my comment above with the results of my investigation. Things look good! We shouldn't need to make any more Vivaria changes to handle importing RE-Bench Inspect runs. TODOs:...