Evaluation EVE: Automatic Partition Testing and Onboarding Control
Description
This PR implements Evaluation EVE, a system that automatically evaluates EVE-OS across multiple partitions (IMGA/IMGB/IMGC) on hardware under test. It provides core infrastructure for sequential partition testing, hardware inventory collection, and onboarding control.
Purpose
Automatically select the best kernel/firmware combination based on:
- Primary criterion: Partition boots successfully
- Secondary criterion: Hardware inventory completeness (devices detected)
- Tiebreaker: Least advanced partition (IMGA < IMGB < IMGC)
What This PR Includes
Automatic Partition Testing
- Sequential testing of all partitions (IMGA → IMGB → IMGC)
- Configurable stability validation (default: 5 minutes per slot)
- Automatic detection and skipping of failed boots
- Partition state reconciliation after crashes/watchdog reboots
Hardware Inventory Collection
- Collects hardware data per partition: PCI devices, USB devices, kernel parameters, IOMMU groups
- Persisted to
/persist/eval/<partition>-YYYY-MM-DD-HH:MM/ - Automatic cleanup (30-day retention)
- Status tracking via PubSub
Onboarding Control
- Blocks device onboarding during evaluation
- Only final partition (after all tested) connects to controller
- PubSub-based coordination between evalmgr and client agents
- Real-time progress updates
Robust State Management
- Persistent state survives reboots (
/persist/eval/state.json) - Per-partition metadata (boot count, last boot time, failures)
- Integration with zboot partition states
- Scheduler state machine: Idle → StabilityWait → Scheduled → Finalized
Architecture
New evalmgr Agent (pkg/pillar/cmd/evalmgr/, ~2,100 lines)
- Platform detection (
/etc/eve-platformcontains "evaluation") - Partition state reconciliation on boot
- Stability validation with configurable timers
- Hardware inventory collection and status tracking
- Automatic scheduling of next partition
- Status publishing via PubSub
Integration Points
- client agent: Gates onboarding until evaluation completes
- diag tool: Displays evaluation status and inventory collection progress
- zboot: IMGC partition support for evaluation platforms
- device-steps.sh: Starts evalmgr before client agent
- mkimage-raw-efi: Initializes evaluation partitions with correct priorities
Testing
Comprehensive test suite (1,670 lines):
- Multi-boot evaluation flow simulation
- Failure recovery scenarios
- Inventory collection verification with event tracking
- GRUB boot selection validation
- All 13 tests passing
Commit Structure
- Types & Interfaces - Core data structures (298 lines)
- GPT Access Layer - Partition management abstraction
- System Reset - Reboot handling component
- Persistent State - State management across reboots
- Evaluation Agent - Main orchestration logic with platform detection
- Test Infrastructure - Complete test suite with inventory event verification
- Diagnostic Display - Status visibility
- Partition Initialization - EFI partition setup
- Hardware Inventory - Collection and persistence
- Inventory Status - PubSub integration
PR dependencies
https://github.com/lf-edge/eve/pull/5348 - MERGED
How to test and validate this PR
- build evaluation installer
make PLATFORM=evaluation installer-raw - install eve, observe diag output reports status
- check that after evalmgr is done /persis/eval has status.json that has status for all IMB[A,B,C] partitions
Changelog notes
- Automatic Partition Testing and Onboarding Control for Evaluation EVE
PR Backports
- 14.5-stable: No, as the feature is not available there.
- 13.4-stable: No, as the feature is not available there.
Also, to the PRs that should be backported into any stable branch, please
add a label stable.
Checklist
- [x] I've provided a proper description
- [ ] I've added the proper documentation
- [x] I've tested my PR on amd64 device
- [ ] I've tested my PR on arm64 device
- [x] I've written the test verification instructions
- [x] I've set the proper labels to this PR
And the last but not least:
- [x] I've checked the boxes above, or I've provided a good reason why I didn't check them.
Please, check the boxes above after submitting the PR in interactive mode.
Codecov Report
:white_check_mark: All modified and coverable lines are covered by tests.
:white_check_mark: Project coverage is 20.39%. Comparing base (2281599) to head (81c7bfa).
:warning: Report is 61 commits behind head on master.
Additional details and impacted files
@@ Coverage Diff @@
## master #5351 +/- ##
==========================================
+ Coverage 19.52% 20.39% +0.86%
==========================================
Files 19 19
Lines 3021 2314 -707
==========================================
- Hits 590 472 -118
+ Misses 2310 1721 -589
Partials 121 121
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.
:rocket: New features to boost your workflow:
- :snowflake: Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
Configurable stability validation (default: 5 minutes per slot)
The default watchdog timer for the touch files is 300 seconds, but since it will take a while for a watchdog to trigger we actually wait for twice that before we declare a EVE update to be successful. So the safe thing would be to wait longer here as well, unless we can quantify the time it takes to actually watchdog.
FWIW you can manually run /opt/zededa/bin/faultinjection -H to cause a touch file watchdog.