exo icon indicating copy to clipboard operation
exo copied to clipboard

refactor all information sources (including ipless rdma discovery)

Open Evanev7 opened this issue 5 days ago • 0 comments

Motivation

Information gathering is tightly coupled to MacMon - we should start generalizing our information sources so we can add more in future.

Changes

Added a new system to gather any information. Currently, it is attached to the Worker - though this is mostly to keep the data processing logic simple. It could be made independent quite easily.

I also refactored topology to include different kinds of connections as we can gather RDMA connections without having a pre-existing socket connection, and made the relevant placement updates. We should no longer need the network locations script in the app.

Other sources of information now include:

  • static node information like "model" and "chip" (macos, "Unknown" fallback)
  • device friendly name (macos, falls back to device hostname)
  • network interfaces + ips (cross platform)
  • thunderbolt interfaces (macos)
  • thunderbolt connections (macos)
  • RAM usage (cross platform)
  • per-device configuration written to EXO_HOME/config.toml

Limitations

The current events added by the InfoGatherer are much too broad and don't follow proper Pydantic validation.

Model and Chip are not cross platform concepts.

We do not differentiate between unified and non-unified memory systems. this should be added to static information ASAP.

A lot of this data collection is based on simple timers. Watching the SC store on macos is the correct way to gather some of this information, but requires a detour into rust for macos.

Why It Works

The InfoGatherer is a generic subsystem which returns a union of metric datatypes. It writes them to an event, which is applied to state. It is currently re-spawned with the worker so each cluster receives the correct information.

As for topology, macOS identifies TB ports with a uuid in SPThunderboltDataType, and also stores remote uuids if it can find them. These changes read that data with the system_profiler, hopefully not so often as to cause notable performance impacts (though this should be tuned) but frequently enough for moderate responsiveness. As we can identify TB connections between devices without needing ips attached to each interface, we can remove the network setup script (almost) completely.

Test Plan

Manual Testing

TODO: Spawn RDMA instances without enabling DHCP on the RDMA interfaces.

Automated Testing

Updated the current master and shared tests to cover the topology refactor and new events.

Evanev7 avatar Dec 19 '25 18:12 Evanev7