eve icon indicating copy to clipboard operation
eve copied to clipboard

Evaluation EVE: Automatic Partition Testing and Onboarding Control

Open rucoder opened this issue 1 month ago • 2 comments

Description

This PR implements Evaluation EVE, a system that automatically evaluates EVE-OS across multiple partitions (IMGA/IMGB/IMGC) on hardware under test. It provides core infrastructure for sequential partition testing, hardware inventory collection, and onboarding control.

Purpose

Automatically select the best kernel/firmware combination based on:

  • Primary criterion: Partition boots successfully
  • Secondary criterion: Hardware inventory completeness (devices detected)
  • Tiebreaker: Least advanced partition (IMGA < IMGB < IMGC)

What This PR Includes

Automatic Partition Testing

  • Sequential testing of all partitions (IMGA → IMGB → IMGC)
  • Configurable stability validation (default: 5 minutes per slot)
  • Automatic detection and skipping of failed boots
  • Partition state reconciliation after crashes/watchdog reboots

Hardware Inventory Collection

  • Collects hardware data per partition: PCI devices, USB devices, kernel parameters, IOMMU groups
  • Persisted to /persist/eval/<partition>-YYYY-MM-DD-HH:MM/
  • Automatic cleanup (30-day retention)
  • Status tracking via PubSub

Onboarding Control

  • Blocks device onboarding during evaluation
  • Only final partition (after all tested) connects to controller
  • PubSub-based coordination between evalmgr and client agents
  • Real-time progress updates

Robust State Management

  • Persistent state survives reboots (/persist/eval/state.json)
  • Per-partition metadata (boot count, last boot time, failures)
  • Integration with zboot partition states
  • Scheduler state machine: Idle → StabilityWait → Scheduled → Finalized

Architecture

New evalmgr Agent (pkg/pillar/cmd/evalmgr/, ~2,100 lines)

  • Platform detection (/etc/eve-platform contains "evaluation")
  • Partition state reconciliation on boot
  • Stability validation with configurable timers
  • Hardware inventory collection and status tracking
  • Automatic scheduling of next partition
  • Status publishing via PubSub

Integration Points

  • client agent: Gates onboarding until evaluation completes
  • diag tool: Displays evaluation status and inventory collection progress
  • zboot: IMGC partition support for evaluation platforms
  • device-steps.sh: Starts evalmgr before client agent
  • mkimage-raw-efi: Initializes evaluation partitions with correct priorities

Testing

Comprehensive test suite (1,670 lines):

  • Multi-boot evaluation flow simulation
  • Failure recovery scenarios
  • Inventory collection verification with event tracking
  • GRUB boot selection validation
  • All 13 tests passing

Commit Structure

  1. Types & Interfaces - Core data structures (298 lines)
  2. GPT Access Layer - Partition management abstraction
  3. System Reset - Reboot handling component
  4. Persistent State - State management across reboots
  5. Evaluation Agent - Main orchestration logic with platform detection
  6. Test Infrastructure - Complete test suite with inventory event verification
  7. Diagnostic Display - Status visibility
  8. Partition Initialization - EFI partition setup
  9. Hardware Inventory - Collection and persistence
  10. Inventory Status - PubSub integration

PR dependencies

https://github.com/lf-edge/eve/pull/5348 - MERGED

How to test and validate this PR

  1. build evaluation installer make PLATFORM=evaluation installer-raw
  2. install eve, observe diag output reports status
  3. check that after evalmgr is done /persis/eval has status.json that has status for all IMB[A,B,C] partitions

Changelog notes

  • Automatic Partition Testing and Onboarding Control for Evaluation EVE

PR Backports

- 14.5-stable: No, as the feature is not available there.
- 13.4-stable: No, as the feature is not available there.

Also, to the PRs that should be backported into any stable branch, please add a label stable.

Checklist

  • [x] I've provided a proper description
  • [ ] I've added the proper documentation
  • [x] I've tested my PR on amd64 device
  • [ ] I've tested my PR on arm64 device
  • [x] I've written the test verification instructions
  • [x] I've set the proper labels to this PR

And the last but not least:

  • [x] I've checked the boxes above, or I've provided a good reason why I didn't check them.

Please, check the boxes above after submitting the PR in interactive mode.

rucoder avatar Nov 05 '25 14:11 rucoder

Codecov Report

:white_check_mark: All modified and coverable lines are covered by tests. :white_check_mark: Project coverage is 20.39%. Comparing base (2281599) to head (81c7bfa). :warning: Report is 61 commits behind head on master.

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #5351      +/-   ##
==========================================
+ Coverage   19.52%   20.39%   +0.86%     
==========================================
  Files          19       19              
  Lines        3021     2314     -707     
==========================================
- Hits          590      472     -118     
+ Misses       2310     1721     -589     
  Partials      121      121              

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

:rocket: New features to boost your workflow:
  • :snowflake: Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

codecov[bot] avatar Nov 07 '25 10:11 codecov[bot]

Configurable stability validation (default: 5 minutes per slot)

The default watchdog timer for the touch files is 300 seconds, but since it will take a while for a watchdog to trigger we actually wait for twice that before we declare a EVE update to be successful. So the safe thing would be to wait longer here as well, unless we can quantify the time it takes to actually watchdog.

FWIW you can manually run /opt/zededa/bin/faultinjection -H to cause a touch file watchdog.

eriknordmark avatar Nov 08 '25 02:11 eriknordmark