exo icon indicating copy to clipboard operation
exo copied to clipboard

Split NodePerformanceProfile state storage into separate mappings

Open AlexCheema opened this issue 1 month ago • 0 comments

Motivation

The monolithic NodePerformanceProfile stored all node profile data together, but the data comes from different events with different update frequencies:

  • Identity (model_id, chip_id, friendly_name) - updated every 30s
  • Memory - updated every 0.5s
  • System (GPU, temp, power) - updated every 1s
  • Network interfaces - updated every 30s

Storing all this in a single mapping meant every update replaced the entire profile object. This refactor splits the storage to match the event structure, making updates more efficient and the code cleaner.

Dashboard responsiveness: This makes the dashboard much more responsive - we can immediately show the device name, type, and memory as soon as those events arrive, while slower metrics like temperature and GPU utilization follow shortly after. Previously, we had to wait for all metrics before displaying anything useful.

Prerequisite for memory bandwidth profiling: This is also a necessary prerequisite for adding memory bandwidth profiling, which is quite slow to measure and would block other metrics if bundled together.

Changes

Python State:

  • Added NodeIdentity class to profiling.py with model_id, chip_id, friendly_name
  • Replaced node_profiles: Mapping[NodeId, NodePerformanceProfile] in state.py with four separate mappings:
    • node_identities: Mapping[NodeId, NodeIdentity]
    • node_memories: Mapping[NodeId, MemoryPerformanceProfile]
    • node_systems: Mapping[NodeId, SystemPerformanceProfile]
    • node_networks: Mapping[NodeId, list[NetworkInterfaceInfo]]
  • Rewrote apply functions in apply.py to write to their specific storage
  • Added _reconstruct_profile() helper to rebuild NodePerformanceProfile for topology updates
  • Updated api.py memory calculation to use state.node_memories directly

Worker Polling:

  • Changed emit_identity_metrics to start_polling_identity_metrics - now polls every 30s instead of emitting once
  • All metrics now follow the same pattern: emit immediately, then poll periodically

Dashboard:

  • Removed RawNodeProfile interface entirely
  • Added split state interfaces: RawNodeIdentity, RawNodeMemory, RawNodeSystem, RawNetworkInterface
  • Updated RawStateResponse to include split fields
  • Simplified transformTopology() to use split types directly with safe defaults for missing data

Why It Works

Each apply function now updates only its specific mapping, and the topology still gets a reconstructed full profile for placement logic compatibility. The dashboard gracefully handles partial data (e.g., if memory hasn't arrived yet, it defaults to 0). All metrics are emitted immediately on startup and then polled periodically.

Test Plan

Manual Testing

Automated Testing

  • Type checker passes: uv run basedpyright - 0 errors
  • All tests pass: uv run pytest - 151 passed
  • Dashboard builds successfully: npm run build

AlexCheema avatar Jan 17 '26 20:01 AlexCheema