Define and record non-task performance metrics
FLE has potential to be an exceptional environment for evaluating agent architectures. Part of this is the demanding nature of processing the factorio game state, planning complex programs and then executing them. For now the environment is not truly real-time since the state is reset prior to the next step's program execution. But my expectation is that there will be interest in getting agents to play the game with a real clock that doesn't stop. This is also important for testing out approaches multi-agent coordination in truly concurrent settings.
To build toward this, it's essential we define and record non-task performance metrics. Examples include:
- E2E latency for API calls (consider vision inputs as well)
- Time spent reasoning
- Time spent waiting on API retries
- Token length of generated program
- In-game ticks associated with program execution (pretty sure this is implemented but we should record it as metadata)
- actions per unit time (derived from above stats)
- attempts/time/tokens spent on backtracking due to generated program errors
Some task performance metrics beyond the holistic production score could be good too. I think logging energy production and all material generation could be useful.
See #207