factorio-learning-environment icon indicating copy to clipboard operation
factorio-learning-environment copied to clipboard

Define and record non-task performance metrics

Open kantneel opened this issue 7 months ago • 1 comments

FLE has potential to be an exceptional environment for evaluating agent architectures. Part of this is the demanding nature of processing the factorio game state, planning complex programs and then executing them. For now the environment is not truly real-time since the state is reset prior to the next step's program execution. But my expectation is that there will be interest in getting agents to play the game with a real clock that doesn't stop. This is also important for testing out approaches multi-agent coordination in truly concurrent settings.

To build toward this, it's essential we define and record non-task performance metrics. Examples include:

  • E2E latency for API calls (consider vision inputs as well)
  • Time spent reasoning
  • Time spent waiting on API retries
  • Token length of generated program
  • In-game ticks associated with program execution (pretty sure this is implemented but we should record it as metadata)
  • actions per unit time (derived from above stats)
  • attempts/time/tokens spent on backtracking due to generated program errors

Some task performance metrics beyond the holistic production score could be good too. I think logging energy production and all material generation could be useful.

kantneel avatar May 15 '25 09:05 kantneel

See #207

kantneel avatar May 25 '25 06:05 kantneel