KVM memballooning requires free page reporting and autodeflate
Description
As per #11930 KVM's memory ballooning does not auto-inflate and auto-deflate without the Free Page Reporting and autodeflate attributes on the memballoon configuration.
This is a change similar to OpenStack's NOVA: https://github.com/openstack/nova/commit/cd401c5c1b5dcba739d69875795828e1be1d726b
Basically if memballooning is enabled for KVM, these features are also always enabled. Having memory ballooning on in cloudstack otherwise does nothing as there is no tooling to inflate or deflate the balloon.
Types of changes
- [ ] Breaking change (fix or feature that would cause existing functionality to change)
- [ ] New feature (non-breaking change which adds functionality)
- [ ] Bug fix (non-breaking change which fixes an issue)
- [x] Enhancement (improves an existing feature and functionality)
- [ ] Cleanup (Code refactoring and cleanup, that may add test cases)
- [ ] Build/CI
- [ ] Test (unit or integration test code)
Feature/Enhancement Scale or Bug Severity
Feature/Enhancement Scale
- [ ] Major
- [x] Minor
Bug Severity
- [ ] BLOCKER
- [ ] Critical
- [x] Major
- [ ] Minor
- [ ] Trivial
Screenshots (if appropriate):
How Has This Been Tested?
Not yet tested, hoping CI/CD runs through some self-tests, then I'll try a test deployment.
How did you try to break this feature and the system with this change?
N/A
@blueorangutan package
@DaanHoogland a [SL] Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress.
Codecov Report
:x: Patch coverage is 75.00000% with 1 line in your changes missing coverage. Please review.
:white_check_mark: Project coverage is 17.56%. Comparing base (8089d32) to head (f43cdf9).
:warning: Report is 207 commits behind head on 4.22.
| Files with missing lines | Patch % | Lines |
|---|---|---|
| ...om/cloud/hypervisor/kvm/resource/LibvirtVMDef.java | 75.00% | 0 Missing and 1 partial :warning: |
Additional details and impacted files
@@ Coverage Diff @@
## 4.22 #11932 +/- ##
============================================
+ Coverage 17.36% 17.56% +0.20%
- Complexity 15245 15541 +296
============================================
Files 5888 5909 +21
Lines 525831 529061 +3230
Branches 64183 64618 +435
============================================
+ Hits 91298 92924 +1626
- Misses 424227 425683 +1456
- Partials 10306 10454 +148
| Flag | Coverage Δ | |
|---|---|---|
| uitests | 3.58% <ø> (-0.05%) |
:arrow_down: |
| unittests | 18.63% <75.00%> (+0.22%) |
:arrow_up: |
Flags with carried forward coverage won't be shown. Click here to find out more.
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.
:rocket: New features to boost your workflow:
- :snowflake: Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
- :package: JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.
pretty sure the simulator failure is unrelated to my changes
Packaging result [SF]: ✔️ el8 ✔️ el9 ✔️ el10 ✔️ debian ✔️ suse15. SL-JID 15585
@blueorangutan test keepEnv
@DaanHoogland a [SL] Trillian-Jenkins test job (ol8 mgmt + kvm-ol8) has been kicked to run smoke tests
[SF] Trillian test result (tid-14747) Environment: kvm-ol8 (x2), zone: Advanced Networking with Mgmt server ol8 Total time taken: 65004 seconds Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr11932-t14747-kvm-ol8.zip Smoke tests completed. 148 look OK, 0 have errors, 1 did not run Only failed and skipped tests results shown below:
| Test | Result | Time (s) | Test File |
|---|---|---|---|
| all_test_human_readable_logs | Skipped |
--- | test_human_readable_logs.py |
@weizhouapache looks like the smoke test passed
@weizhouapache looks like the smoke test passed
good @bradh352 now let's wait someone to test it
@weizhouapache is there a package artifact generated by this I can download and use to test? I'm still getting familiar with cloudstack, I'd rather not figure out how to build .deb packages for it if I don't have to :)
@weizhouapache is there a package artifact generated by this I can download and use to test? I'm still getting familiar with cloudstack, I'd rather not figure out how to build .deb packages for it if I don't have to :)
we do not provide the packages publicly, as the cloudstack-management-xx.deb package is huge
I'd rather not figure out how to build .deb packages for it if I don't have to :)
@bradh352, I personally use this DEB builder: https://github.com/scclouds/cloudstack-deb-builder.
By default, it does not populate the cloudstack-management DEB with the system VM template and, therefore, the resulting package is more lightweight. However, the ACS environment must have the correct system VM template configured before installing the PR's packages.
To build the packages, go to the Apache CloudStack source code root directory and execute:
docker run -v <code-path>:/mnt/build/cloudstack -v ~/.m2:/root/.m2 -e "USER_ID=$(id -u)" -e "USER_GID=$(id -g)" -e "ACS_BUILD_OPTS=-T <number-of-threads> -DskipTests" scclouds/cloudstack-deb-builder:ubuntu2004-jdk11-python3
Where:
<code-path>is the path of the Apache CloudStack source code directory; and,<number-of-threads>is the number of threads that will be used to build the artifacts.
For example:
docker run -v ~/code/cloudstack:/mnt/build/cloudstack -v ~/.m2:/root/.m2 -e "USER_ID=$(id -u)" -e "USER_GID=$(id -g)" -e "ACS_BUILD_OPTS=-T 4 -DskipTests" scclouds/cloudstack-deb-builder:ubuntu2004-jdk11-python3
It may take a little bit longer for the first build process to finish, due to the download process of some dependencies. After it is finished, the DEB packages will be located inside the Apache CloudStack source code directory, at the /dist/debbuild/DEBS path.
Hope it helps!
I'd rather not figure out how to build .deb packages for it if I don't have to :)
@bradh352, I personally use this DEB builder: https://github.com/scclouds/cloudstack-deb-builder.
By default, it does not populate the
cloudstack-managementDEB with the system VM template and, therefore, the resulting package is more lightweight. However, the ACS environment must have the correct system VM template configured before installing the PR's packages.To build the packages, go to the Apache CloudStack source code root directory and execute:
docker run -v <code-path>:/mnt/build/cloudstack -v ~/.m2:/root/.m2 -e "USER_ID=$(id -u)" -e "USER_GID=$(id -g)" -e "ACS_BUILD_OPTS=-T <number-of-threads> -DskipTests" scclouds/cloudstack-deb-builder:ubuntu2004-jdk11-python3Where:
<code-path>is the path of the Apache CloudStack source code directory; and,<number-of-threads>is the number of threads that will be used to build the artifacts.For example:
docker run -v ~/code/cloudstack:/mnt/build/cloudstack -v ~/.m2:/root/.m2 -e "USER_ID=$(id -u)" -e "USER_GID=$(id -g)" -e "ACS_BUILD_OPTS=-T 4 -DskipTests" scclouds/cloudstack-deb-builder:ubuntu2004-jdk11-python3It may take a little bit longer for the first build process to finish, due to the download process of some dependencies. After it is finished, the DEB packages will be located inside the Apache CloudStack source code directory, at the
/dist/debbuild/DEBSpath.Hope it helps!
Thanks, I'm going to test this out tomorrow!
@bernardodemarco I did get .deb files generated using your method above like ./dist/debbuild/DEBS/cloudstack-agent_4.22.1.0-SNAPSHOT~focal_all.deb But given that the image is ubuntu 20.04 is there any problem trying to actually install and this on ubuntu 24.04? I didn't see a docker template for ubuntu 24.04 available.
Ok, I have a custom built 4.22 with this PR installed in my test environment.
virsh dumpxml of a running vm shows:
<memballoon model='virtio' autodeflate='on' freePageReporting='on'>
<stats period='10'/>
<alias name='balloon0'/>
<address type='pci' domain='0x0000' bus='0x00' slot='0x06' function='0x0'/>
</memballoon>
And in turn the generated qemu command line has this:
-device {"driver":"virtio-balloon-pci","id":"balloon0","deflate-on-oom":true,"free-page-reporting":true,"bus":"pci.0","addr":"0x6"}
VMs start up just fine. I'm going to now try to allocate a bunch of memory within a series of VMs and release it and inspect if the host is able to reclaim the memory.
Well, honestly the first test was enough to convince me it is working well:
Fresh VM (wihtin VM):
root@test1:~# free
total used free shared buff/cache available
Mem: 16369284 617328 15567232 11004 474740 15751956
Swap: 0 0 0
In Hypervisor:
root@node1:~# ps -o pid,rss,vsz,command -p 1417359
PID RSS VSZ COMMAND
1417359 1408680 19634724 /usr/bin/qemu-system-x86_64 -name guest=i-2-304-VM,...
In VM, allocate a tmpfs mount and a large file:
root@test1:~# mount -t tmpfs tmpfs /mnt
root@test1:~# dd if=/dev/zero of=/mnt/bigfile bs=1M count=10000
dd: error writing '/mnt/bigfile': No space left on device
7993+0 records in
7992+0 records out
8381071360 bytes (8.4 GB, 7.8 GiB) copied, 3.95783 s, 2.1 GB/s
root@test1:~# free
total used free shared buff/cache available
Mem: 16369284 8805920 7369168 8195644 8678156 7563364
Swap: 0 0 0
In Hypervisor:
root@node1:~# ps -o pid,rss,vsz,command -p 1417359
PID RSS VSZ COMMAND
1417359 9782676 19615192 /usr/bin/qemu-system-x86_64 -name guest=i-2-304-VM,...
Cleanup:
root@test1:~# rm -f /mnt/bigfile && umount /mnt
root@test1:~# free
total used free shared buff/cache available
Mem: 16369284 599236 15467600 11004 594464 15770048
Swap: 0 0 0
In Hypervisor:
root@node1:~# ps -o pid,rss,vsz,command -p 1417359
PID RSS VSZ COMMAND
1417359 1523988 19610052 /usr/bin/qemu-system-x86_64 -name guest=i-2-304-VM,...
@blueorangutan package
@rajujith a [SL] Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress.
Packaging result [SF]: ✔️ el8 ✔️ el9 ✔️ el10 ✔️ debian ✔️ suse15. SL-JID 15913
@blueorangutan test
@DaanHoogland a [SL] Trillian-Jenkins test job (ol8 mgmt + kvm-ol8) has been kicked to run smoke tests
[SF] Trillian Build Failed (tid-14911)
[SF] Trillian test result (tid-14914) Environment: kvm-ol8 (x2), zone: Advanced Networking with Mgmt server ol8 Total time taken: 56092 seconds Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr11932-t14914-kvm-ol8.zip Smoke tests completed. 145 look OK, 4 have errors, 0 did not run Only failed and skipped tests results shown below:
| Test | Result | Time (s) | Test File |
|---|---|---|---|
| test_uservm_host_control_state | Failure |
16.96 | test_host_control_state.py |
| ContextSuite context=TestHostControlState>:teardown | Error |
30.42 | test_host_control_state.py |
| test_01_vpn_usage | Error |
1.09 | test_usage.py |
| test_01_migrate_VM_and_root_volume | Error |
86.57 | test_vm_life_cycle.py |
| test_02_migrate_VM_with_two_data_disks | Error |
28.54 | test_vm_life_cycle.py |
| test_01_secure_vm_migration | Error |
179.26 | test_vm_life_cycle.py |
| test_01_secure_vm_migration | Error |
179.27 | test_vm_life_cycle.py |
| test_hostha_enable_ha_when_host_disabled | Error |
1.66 | test_hostha_kvm.py |
| test_hostha_enable_ha_when_host_in_maintenance | Error |
303.85 | test_hostha_kvm.py |
tested in a lab env executing echo {1..10000000} in the vm and monitoring the host with watch free. I can see the host memory shoot up and when the shell is aborted due to stack overflow, the host memory gos to the original state again. I am not sure if my test is adequate enough though. cc @bradh352 @weizhouapache @rajujith @bernardodemarco @NuxRo .
I lean towards merging.
We are running this in production now on a heavily over-provisioned private cloud used for development VMs where a developer may allocate 64G or more RAM, but only use them in "bursts" while compiling code.
We also set up an hourly cron job in our dev vm template that developers use which does
echo 1 > /proc/sys/vm/drop_caches
echo 1 > /proc/sys/vm/compact_memory
(as otherwise the buffers/cache in the VM will hold a lot of memory that can't be released).
And the results have been nothing short of amazing. We went from being maxed out on memory usage on each host to only utilizing 25%.
This is an unusual use case, but has been phenomenal for us.
@bernardodemarco , any issues that would stop us from merging?
side-effect: these will be set by default, hope no regressions.
yes, but only if memballoon is enabled on the host, right?
side-effect: these will be set by default, hope no regressions.
yes, but only if memballoon is enabled on the host, right?
correct, if memballoon is enabled for the KVM host, this is enabled automatically. There is no impact if memballoon is disabled.
As far as I can tell with memballoon, there was no actual KVM integration within cloudstack that would do anything useful. Adding these flags makes it actually do something useful.
side-effect: these will be set by default, hope no regressions.
yes, but only if memballoon is enabled on the host, right?
correct, if memballoon is enabled for the KVM host, this is enabled automatically. There is no impact if memballoon is disabled.
As far as I can tell with memballoon, there was no actual KVM integration within cloudstack that would do anything useful. Adding these flags makes it actually do something useful.
I remember some users get the real memory usage by enabling memory ballooning. maybe they can test cc @JoaoJandre @bernardodemarco