save and load from factorio
Changes to make cluster/local work with saves
1:1 saves directories created for n instance in range(1,n) at repo root .fle/saves/{n-1}/
The only issue right now is that the server start command for loading save and loading scenario:
START_COMMAND="--start-server-load-latest"START_COMMAND="--start-server-load-scenario ${SCENARIO}"
Can’t be used together for the headless server. When we load a scenario first and create a save using rcon send_command('/save {name}'), the scenario gets embedded into the save automatically.
So we need to do a dry run with the scenario first, generate a save, and then switch to -l for subsequent restarts. This flow makes sense to me, though I’d prefer something cleaner, as I’m unsure how much implicit behavior we should assume without risking weird assumptions.
@JackHopkins
ive refactored FactorioInstance.
my reasoning for the changes:
- Factorio Instance clearly owns namespace and the game state, that to me is the clear definition of its purpose.
- Moved out the implementation details of rcon via FactorioServer and the script loading via LuaScriptManager.
- Created FactorioServer to own server runtime and its restarts
- a clear distinction of responsibility lets me make a strong implementation for server restart.
All the changes other than the following are syntactical sugar, moving code around to places that seem to own the functionality better. Changes that might affect functionality:
- removing screenshots logic (imo its depreciated i think it expects the existence of client, i can be wrong)
- creating a transaction context manager (unlikely to be problematic, context manager is safer than starting and ending a transaction)
- creating a single class implementation for pre and post hooks via
ToolHookRegistry
right now, FLE couples agent sessions (factorio instance) to servers by loosely getting ips and tcp ports and mapping them 1:1 on the sessions. this is an ordered mapping, so its only loose in concept (while running experiments/episodes the agent always receives the same server before/after resets).
my factorio native save & load implementation needs a save directory to bind 1:1 to a server. the save directory binds the servers and couples them with the current running session of an agent in a more strict manner. if we want to use factorio native saves and loads we need to:
1.) spawn servers from saves. in terms of server spawning we can convert scenarios to save as a preliminary setup step for fresh runs. this lets us make save based (rather than the scenario based spawn at the moment) server spawning the default way rather than having two ways of spawning servers (save and scenario).
2.) attach save volumes to access different saves
this is a strict coupling between state/native-save & a server, ie .fle/saves/0 <-> factorio_0 <-> FLE:factorio instance, this means we would have to incorporate instance.py reset logic to also own the lifecycle of a saves directory as it owns the rest of the reset logic already.*
3.) couple server lifecycle with factorio instance
*letting instance reset saves creates a weird dependancy between servers (factorio_0) and agent sessions (factorio instance):
- agent session
-(expects)->alive servers - alive servers
-(expects)->saves to exist - agent session
-(mutates/resets)->saves for the server - saves
-(dictates)->server's restart entrypoint
we can mitigate it by letting instance own the server spawning logic and strictly coupling it with the server:
- agent session
-(spawns)->server - agent session
-(mutates/resets)->saves - agent session
-(restarts)->server using new save entrypoint
change-logs:
1. decoupled FactorioInstance into:
FactorioInstance: keeps instance level logicAgentInstance: keeps the bits that the agent recieves
This lets me decouple agent level resets and instance level resets, and expose downstream classes like GameSession and AgentSession to only the bits they would touch.
Instances own complete & pure FLE coupled logic, its the core bits of the FLE designed API and its interfaces with Factorio
2. Added Sessions: GameSession & AgentSession
Motivation:
- The factorio interface of FLE does not care about the infra & services, this session layer lets us couple db & docker sensibly with the game lifecycle.
- The RL loop is polling the gamestate, score & production flows from the namespace directly using its own implementation, this doesnt make sense as in fast mode there will be considerable drift in the expected values if this is not done correctly. A better approach is to take snapshots of the game state, production flows, score etc around an eval call and only supply the RL loop with the snapshotted values.
Sessions couple infra to FLE and provide a clean contract for the RL loop to use expected values from FLE
3. Python based aiodocker for cluster management
- Needed for programmatic lifecycles of cluster in python with the save files.
Minor changes:
FactorioClientwrapper for rcon connection & transactions so that we can manage connection lifecycle outside instance.Transactionsusing context manager, low hanging fruit.
- updated
pytest.fixturesto provide test specific instances (regular instance, unresearched instance, we can add other test specific fixtures here as needed) rather than creating instances inside tests. - moved db, rcon and docker to
fle/servicesrather thanfle/commons, there is a chance we can remove commons entirely or keep it only for shared models - moved all FLE game related environment modules into
fle/env/game- eg. it didnt make sense for GameState to be part of commons because its completely interdependant with instance.py
- removed
clusterdirectory completely, the only it. - for the most part all sub-modules now only import from scripts in the same level or sub-modules at a lower level:
- eg. all
fle/env/gamescripts only import fromfle/env/gameorfle/env/services/rcon
- eg. all
Would need to move script loading done over rcon to modular scripts embedded into the scenario's control.lua because whenever the server restarts we would need to load all the scripts again over rcon which is slow (5-10 seconds). This would add up with every restart for game reset.
I like the increased modularity and separation of concerns in the instance and session levels. I also like the additional structure with new classes like AbstractTrajectoryRunner, GameConfig, FactorioClient, ToolHookRegistry etc. These changes definitely improve readability and extensibility.
You mentioned that "there is a minor connection error when running evals, so that is still unstable." Can you clarify what the issue is and do you think you know the fix for it? The unit tests should definitely pass before we merge this but also I think we should have functional tests in the form of running eval trajectories to completion and getting expected performance levels from agents