eve icon indicating copy to clipboard operation
eve copied to clipboard

eve-k: Enable ext4 vault support

Open andrewd-zededa opened this issue 2 months ago • 4 comments

Description

  • New location for kube service container /var/lib/ items: a bind mount in /persist/vault/kube.
  • Enables no ZFS in io path.
  • Installer default is changed to use ext4 instead of zfs, this follows the default behavior in HV=kvm eve.
  • A single cluster can contain nodes with mixed persist types.
  • Multiple eve_persist_disk disk will default to zfs persist.

EXT4 Persist Testing Completed:

  • USB install and BaseOS updates on amd64 systems.
  • Cluster create
  • VM App Instances deployed
  • VM App Failover between nodes
  • make HV=k live run-live with ZARCH=amd64 and ZARCH=arm64

ZFS Persist Testing Completed:

  • Regression testing of USB install of existing ZFS persist type with eve_install_zfs_with_raid_level=none

Mixed Testing:

  • Cluster create of two ext4-persist nodes and 1 zfs-persist node, deployed 3 VMs evenly distributed across nodes.

PR dependencies

None

How to test and validate this PR

  • Install HV=k eve without grub option "eve_install_zfs_with_raid_level"
  • eve enter kube
  • Verify /var/lib/all_components_initialized file is present.
  • Verify node is present in "kubectl get node"
  • Deploy a VM app instance to the edge node and verify app instance meets running state.

Changelog notes

Enable eve-k on ext4 persist.

PR Backports

  • 16.0-stable: Yes, To be backported.
  • 14.5-stable: No, as the feature is not available there.
  • 13.4-stable: No, as the feature is not available there.

Checklist

  • [x] I've provided a proper description
  • [x] I've added the proper documentation
  • [x] I've tested my PR on amd64 device
  • [x] I've tested my PR on arm64 device
  • [x] I've written the test verification instructions
  • [ ] I've set the proper labels to this PR

And the last but not least:

  • [ ] I've checked the boxes above, or I've provided a good reason why I didn't check them.

Please, check the boxes above after submitting the PR in interactive mode.

andrewd-zededa avatar Nov 10 '25 15:11 andrewd-zededa

Codecov Report

:white_check_mark: All modified and coverable lines are covered by tests. :white_check_mark: Project coverage is 28.08%. Comparing base (2281599) to head (252adfa). :warning: Report is 168 commits behind head on master.

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #5373      +/-   ##
==========================================
+ Coverage   19.52%   28.08%   +8.55%     
==========================================
  Files          19       19              
  Lines        3021     2314     -707     
==========================================
+ Hits          590      650      +60     
+ Misses       2310     1520     -790     
- Partials      121      144      +23     

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

:rocket: New features to boost your workflow:
  • :snowflake: Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

codecov[bot] avatar Nov 10 '25 16:11 codecov[bot]

I'm not saying this needs to be supported, but can there be a kubernetes cluster where some nodes use longhorn on top of ZFS and others use longhorn on top of ext4? I hoping longhorn doesn't need to know what filesystem is underneath, so I'm asking to understand any architectural dependencies.

Yeah that should work without changes. I actually have that running locally where two nodes are ext4 and one is zfs, a VM app instance deployed to each node.

andrewd-zededa avatar Dec 12 '25 17:12 andrewd-zededa

Updated docs to follow multi-persist disk -> zfs handling on eve-k.

Looks like go tests failing in DPC, this should be unrelated:

time="2025-12-13T00:04:40Z" level=trace msg="dump(test/DeviceNetworkStatus) after Publish\n" pid=1234 source=test
time="2025-12-13T00:04:40Z" level=trace msg="\tkey global" pid=1234 source=test
time="2025-12-13T00:04:40Z" level=trace msg="\trestarted 0" pid=1234 source=test
    dpcmanager_test.go:967: 
        Expected
            <types.DPCState>: 0
        to equal
            <types.DPCState>: 3

DONE 469 tests, 12 skipped, 1 failure in 323.284s
make[1]: *** [Makefile:99: test] Error 1
make[1]: Leaving directory '/opt/actions-runner/_work/eve/eve/pkg/pillar'
make: *** [Makefile:534: test] Error 2

andrewd-zededa avatar Dec 15 '25 14:12 andrewd-zededa

If you rebase on master I think you'll get the fix for the go tests failure.

Did you test with a system deployed with current eve-k being updated to this PR? In that case I assume we don't change the filesystem from ZFS to ext4 but the management (and creation of future volumes) need to take into account that we already have ZFS in place.

The alternative would be to declare the 16.0.0-lts-k-* as dead and create a 16.0.1-lts-k-* with this change, so that nobody installs 16.0.0-lts-k-* in production with an expectation to upgrade. Hmm - maybe this is not an issue since the user can select ZFS or ext4 at install time??

eriknordmark avatar Dec 15 '25 16:12 eriknordmark

If you rebase on master I think you'll get the fix for the go tests failure.

Did you test with a system deployed with current eve-k being updated to this PR? In that case I assume we don't change the filesystem from ZFS to ext4 but the management (and creation of future volumes) need to take into account that we already have ZFS in place.

The alternative would be to declare the 16.0.0-lts-k-* as dead and create a 16.0.1-lts-k-* with this change, so that nobody installs 16.0.0-lts-k-* in production with an expectation to upgrade. Hmm - maybe this is not an issue since the user can select ZFS or ext4 at install time??

Upgrade should be ok since this doesn't change any paths associated with the zfs persist option or user volume instance paths and this will still detect and support zfs persist. I will do an upgrade test to confirm.

andrewd-zededa avatar Dec 16 '25 22:12 andrewd-zededa

Upgrade should be ok since this doesn't change any paths associated with the zfs persist option or user volume instance paths and this will still detect and support zfs persist. I will do an upgrade test to confirm.

@andrewd-zededa let me know when you've tested this (with an already deployed app instance running) and if it keeps running I'll merge the PR.

eriknordmark avatar Dec 18 '25 19:12 eriknordmark

@eriknordmark Rebased off master, and completed upgrade tests of a zfs persist cluster back and forth between 16.0.0-rc6-k-amd64 and 0.0.0-eve-k-ext4-vault-f0966751-k-amd64

Looks like CodeQL failure is unrelated:

Uploading code scanning results
  Uploading results
  Warning: Connect Timeout Error
  Error: Connect Timeout Error
  Warning: An unexpected error occurred when sending a status report: Connect Timeout Error

andrewd-zededa avatar Dec 19 '25 22:12 andrewd-zededa