cloud-provider-openstack
cloud-provider-openstack copied to clipboard
tests: Bump DevStack to Dalmatian (2024.2)
What this PR does / why we need it:
Bump the version of DevStack used in CI from Bobcat (2023.2), which is now EOL, to Dalmatian (2024.2). A future change will bump this further to Epoxy (2025.2).
Which issue this PR fixes(if applicable):
(none)
Special notes for reviewers:
(none)
Release note:
NONE
/hold
This is the second attempt after the first was reverted (#2730). I need to see how this performs. fwiw though, I saw no performance issues locally.
@stephenfin see #2730
I wonder if https://github.com/kubernetes/cloud-provider-openstack/pull/2747 would help.
/retest
/test openstack-cloud-csi-manila-e2e-test previously manila tests took 49m29s cinder tests took 1h50m18s and failed due to timeout
/test openstack-cloud-csi-manila-e2e-test
@EmilienM looks like the #2747 doesn't help
The Kubernetes project currently lacks enough contributors to adequately respond to all PRs.
This bot triages PRs according to the following rules:
- After 90d of inactivity,
lifecycle/staleis applied - After 30d of inactivity since
lifecycle/stalewas applied,lifecycle/rottenis applied - After 30d of inactivity since
lifecycle/rottenwas applied, the PR is closed
You can:
- Mark this PR as fresh with
/remove-lifecycle stale - Close this PR with
/close - Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
/remove-lifecycle stale
Error due to missing zpool module param:
+ lib/host:configure_zswap:45 : sudo tee /sys/module/zswap/parameters/zpool
z3fold
tee: /sys/module/zswap/parameters/zpool: No such file or directory
However, once again we appear to have ended up with a Jammy image despite requesting Noble :confused: Investigating.
@stephenfin thanks for picking this up, fwiw..
https://review.opendev.org/c/openstack/devstack/+/942755
also, expect to see some failures because of:
https://github.com/kubernetes/cloud-provider-openstack/issues/2884
also, expect to see some failures because of:
#2884
Thanks. It might make sense to stick with 2024.2, fix that, then bump to 2025.2 so. Will think on it :thinking:
Okay, we're finally seeing ubuntu-24.04 image :pray: From the logs of one of the jobs
23.48 - Thu, 08 May 2025 18:20:39 +0000 - v. 24.4.1-0ubuntu0~24.04.3
also, expect to see some failures because of: #2884
Thanks. It might make sense to stick with 2024.2, fix that, then bump to 2025.2 so. Will think on it 🤔
I've done this.
Turns out we were never running against Ubuntu 24.04. While Boskos reaps networks, instances, disks etc., it doesn't reap images. We've likely been using the same (Ubuntu 24.04) image for who knows how long at this point :sweat_smile:
https://github.com/kubernetes-sigs/boskos/blob/5993cef5a1c719c33c0936d416b7d935058e1204/cmd/janitor/gcp_janitor.py#L38
Investigating the performance degradation by comparing two recent builds: the last passing one and this failing one.
DevStack is about 60% slower to deploy at 467 seconds (7m47s) versus 652 seconds (10m52s), but that's so small and so variable (based on other failures in between) as to be irrelevant. Looks like it's the tests themselves that take longer. I'm going to rework things so we actually get a response back from ginkgo if the test run fails.
Looks like there are some very significant changes in runtime for tests across the board. Now to figure out why. I've been using the below script to compare results from JUnit files (specifically, the JUnit files from the last success and the most recent failure). The result can be seen in results.csv.
#!/usr/bin/env python3
import csv
import pprint
from lxml import etree
def diff(before: str, after: str):
with open(before) as fh:
passing = etree.parse(fh)
with open(after) as fh:
failing = etree.parse(fh)
passing_results = {}
results_diff = {}
for testcase in passing.findall('.//testcase'):
passing_results[testcase.get('name')] = (
testcase.get('status'), testcase.get('time')
)
for testcase in failing.findall('.//testcase'):
name = testcase.get('name')
if name not in passing_results:
raise Exception('tests missing from runs: this should not happen')
if (
testcase.get('status') != passing_results[name][0] or
testcase.get('status') != 'skipped'
):
results_diff[testcase.get('name')] = {
'before': passing_results[name],
'after': (testcase.get('status'), testcase.get('time')),
}
with open('results.csv', 'w', newline='') as fh:
writer = csv.writer(fh)
for name, diff in results_diff.items():
if name in {
'[ReportBeforeSuite]',
'[SynchronizedBeforeSuite]',
'[SynchronizedAfterSuite]',
'[ReportAfterSuite] Kubernetes e2e suite report',
}:
continue
if diff['before'][0] != diff['after'][0]:
# we might want to look at this later
continue
before_sec = float(diff['before'][1])
after_sec = float(diff['after'][1])
diff_sec = ((after_sec - before_sec) / before_sec) * 100
print(f'{name}')
print(f'\tbefore: {before_sec:0.2f} seconds')
print(f'\tafter: {after_sec:0.2f} seconds')
print(f'\tchange: {diff_sec:0.2f}%')
writer.writerow([name, before_sec, after_sec, diff_sec])
def main():
import argparse
parser = argparse.ArgumentParser()
parser.add_argument(
'before',
help='Before result (passing)',
)
parser.add_argument(
'after',
help='After result (failing)',
)
args = parser.parse_args()
diff(args.before, args.after)
if __name__ == '__main__':
main()
Let's see if we get the same performance issues in Caracal, since that's cuts our diff in half. Proposed here.
Let's see if we get the same performance issues in Caracal, since that's cuts our diff in half. Proposed https://github.com/kubernetes/cloud-provider-openstack/pull/2888.
Tangentially, since we have limited resources, i think in this repo, we should only test with SLURP releases.. i.e., 2024.1 and 2025.1 are more appropriate/relevant than the .2 releases due to their popularity .. we could override this for individual test jobs if necessary to a .2 release..
Let's see if we get the same performance issues in Caracal, since that's cuts our diff in half. Proposed #2888.
Tangentially, since we have limited resources, i think in this repo, we should only test with SLURP releases.. i.e., 2024.1 and 2025.1 are more appropriate/relevant than the .2 releases due to their popularity .. we could override this for individual test jobs if necessary to a .2 release..
I agree.
Further notes for self from my debugging here. Dumping them here in case they're useful to anyone else. I still haven't gotten to the bottom of this.
I've deployed two VMs, both running Ubuntu 22.04 and with the standard ubuntu user. I've then run the following command from the tip of currently master to deploy bobcat on the first:
❯ ansible-playbook -v --user ubuntu --inventory <IP address>, --ssh-common-args "-o StrictHostKeyChecking=no" tests/playbooks/test-csi-cinder-e2e.yaml
Which is the same thing we do in CI. Because I'm hitting docker rate limits, I only run the first 3 roles (install-golang, install-devstack, install-docker) and comment out the rest. I then log in as root and both (a) log in to docker.io (docker login) and (b) clone this repo to /root/src/k8s.io/cloud-provider-openstack. After this, I run the remaining roles, commenting out the Run functional tests for csi-cinder-plugin step from tests/playbooks/roles/install-csi-cinder/tasks/main.yaml since I don't want to run these during "installation".
I then repeat this on the second node, but with the following diff:
diff --git tests/playbooks/roles/install-devstack/defaults/main.yaml tests/playbooks/roles/install-devstack/defaults/main.yaml
index 8a2839dd9..950a17573 100644
--- tests/playbooks/roles/install-devstack/defaults/main.yaml
+++ tests/playbooks/roles/install-devstack/defaults/main.yaml
@@ -1,7 +1,7 @@
---
user: "stack"
workdir: "/home/{{ user }}/devstack"
-branch: "2023.2-eol"
+branch: "stable/2024.1"
enable_services:
- nova
- glance
diff --git tests/playbooks/roles/install-devstack/templates/local.conf.j2 tests/playbooks/roles/install-devstack/templates/local.conf.j2
index 3ec7710a9..c896a73b2 100644
--- tests/playbooks/roles/install-devstack/templates/local.conf.j2
+++ tests/playbooks/roles/install-devstack/templates/local.conf.j2
@@ -39,7 +39,7 @@ ENABLE_SYSCTL_NET_TUNING=true
# increase in swap performance by reducing the amount of data
# written to disk. the overall speedup is porportional to the
# compression ratio and the speed of the swap device.
-ENABLE_ZSWAP=true
+ENABLE_ZSWAP=false
{% if "nova" in enable_services %}
# Nova
(I really should have done the latter change across both branches, but I forgot. I'll do it if I need to redeploy. I've proposed https://review.opendev.org/c/openstack/devstack/+/955670 through https://review.opendev.org/c/openstack/devstack/+/955672 to avoid the need to do this in the future).
Finally, I run a single test across both, since it's a reliable reproducer of the test performance issues, using the below script (run as root):
export GOPATH=/root
export PATH=/usr/local/go/bin:/root/bin:$PATH
export KUBECONFIG=/root/.kube/config
pushd /root/src/k8s.io/cloud-provider-openstack
/tmp/kubernetes/test/bin/e2e.test \
-storage.testdriver=tests/e2e/csi/cinder/test-driver.yaml \
--ginkgo.focus='External Storage \[Driver: cinder\.csi\.openstack\.org\] \[Testpattern: Dynamic PV \(block volmode\)\] volumeMode should fail to use a volume in a pod with mismatched mode \[Slow\]' \
--ginkgo.no-color \
--ginkgo.v \
--ginkgo.timeout=24h \
-test.timeout=0
This is currently giving me different results on bobcat and dalmatian:
# bobcat
Ran 1 of 6920 Specs in 27.589 seconds
# dalmatian
Ran 1 of 6920 Specs in 71.706 seconds
@stephenfin so what is the main reason for tests speed degradation? ZSWAP?
No, I still don't know. I'm still debugging it. So far, I have bumped all the services (for each service switch to the stable/2024.1 branch, run db sync commands, and restart) and compared all the configuration files (nothing out of the ordinary). I'm now bumping dependencies.
Still no luck. I've proceeded (a) bumped all OpenStack services, (b) compared configs for differences, and (c) bumped all other Python dependencies in the global venv. No dice: the performance is still good on the Bobcat VM and poor on the Dalmatian VM. I've shared my changes in the pastebin below in case they are useful to anyone. I'll keep investigating tomorrow.
https://paste.opendev.org/show/bcnPKJz1oFhgNlBVYvPo/
Still no luck, but I have managed to take k3s and CPO out of the loop and can reliably reproduce the issue using a simple script.
https://gist.github.com/stephenfin/0c0437dc6f74c4a2c0baef86bc591678
For some reason, nova is not receiving the event from libvirt and is timing out, as seen in the WARNING log below.
# bobcat
Jul 24 15:44:11 stephenfin-cpo-debug-old nova-compute[220178]: INFO nova.virt.block_device [None req-fe3eb4a7-dde9-410c-b345-421295aed1ae demo demo] [instance: 13587445-eab9-49fe-8b59-e48da2a005ee] Attempting to driver detach volume 8f936215-6a0b-4c99-ab3c-47926bde6d55 from mountpoint /dev/vdb
Jul 24 15:44:11 stephenfin-cpo-debug-old nova-compute[220178]: INFO nova.virt.libvirt.driver [None req-fe3eb4a7-dde9-410c-b345-421295aed1ae demo demo] Successfully detached device vdb from instance 13587445-eab9-49fe-8b59-e48da2a005ee from the live domain config.
# dalamatian
Jul 24 15:48:56 stephenfin-cpo-debug-new nova-compute[68061]: INFO nova.virt.block_device [None req-672548ab-5a3e-4698-b5cd-0127af0d4358 demo demo] [instance: e8d15440-1442-49c3-9356-4592cdb697d2] Attempting to driver detach volume df0017a1-b3c5-4f30-b48c-7c6da9c763a2 from mountpoint /dev/vdb
Jul 24 15:49:16 stephenfin-cpo-debug-new nova-compute[68061]: WARNING nova.virt.libvirt.driver [None req-672548ab-5a3e-4698-b5cd-0127af0d4358 demo demo] Waiting for libvirt event about the detach of device vdb with device alias ua-df0017a1-b3c5-4f30-b48c-7c6da9c763a2 from instance e8d15440-1442-49c3-9356-4592cdb697d2 is timed out.
Jul 24 15:49:16 stephenfin-cpo-debug-new nova-compute[68061]: INFO nova.virt.libvirt.driver [None req-672548ab-5a3e-4698-b5cd-0127af0d4358 demo demo] Successfully detached device vdb from instance e8d15440-1442-49c3-9356-4592cdb697d2 from the live domain config.
Now that I can reproduce this outside of k3s/CPO, I can start testing different combinations to see where it's broken and where it's not.
- Dalmatian + Ubuntu 24.04
- Epoxy + Ubuntu 22.04
- Epoxy + Ubuntu 24.04
- Flamingo + Ubuntu 24.04 (22.04 is not supported)
To be continued next week.
I continued my testing. It seems every release is broken since Bobcat, regardless of Ubuntu version.
| OpenStack | Ubuntu | Result |
|---|---|---|
| 2023.2 (Bobcat) | 22.04 (Jammy) | 19.22 seconds |
| 2024.1 (Caracal) | 22.04 (Jammy) | 41.88 seconds |
| 2024.2 (Dalmatian) | 22.04 (Jammy) | 44.45 seconds |
| 2024.2 (Dalmatian) | 24.04 (Noble) | 50.58 seconds |
| 2025.1 (Epoxy) | 22.04 (Jammy) | 47.83 seconds |
| 2025.1 (Epoxy) | 24.04 (Noble) | 47.77 seconds |
| 2025.2 (Flamingo) | 24.04 (Noble) | 46.48 seconds |
[!NOTE] Caracal only supported Ubuntu 24.04 (Noble) as experimental, while Flamingo does not support Ubuntu 24.04 (Jammy), so both of these cases are skipped
Notes below for completeness.
Test runtimes
- Bobcat: https://governance.openstack.org/tc/reference/runtimes/2023.2.html
- Carcal: https://governance.openstack.org/tc/reference/runtimes/2024.1.html
- Dalmatian: https://governance.openstack.org/tc/reference/runtimes/2024.2.html
- Epoxy: https://governance.openstack.org/tc/reference/runtimes/2025.1.html
- Flamingo: https://governance.openstack.org/tc/reference/runtimes/2025.2.html
Libvirt versions
We update everything before starting deployment, so we'll get the latest version of Libvirt and QEMU. For Ubuntu 22.04 (Jammy), this is:
$ virsh version
Compiled against library: libvirt 8.0.0
Using library: libvirt 8.0.0
Using API: QEMU 8.0.0
Running hypervisor: QEMU 6.2.0
For Ubuntu 24.04 (Noble), this is:
$ virsh version
Compiled against library: libvirt 10.0.0
Using library: libvirt 10.0.0
Using API: QEMU 10.0.0
Running hypervisor: QEMU 8.2.2
Setup
The following were run on all VMs.
Update VM and reboot to try eliminate distro issues.
sudo apt update && sudo apt upgrade -y && sudo reboot
Deploy DevStack with correct branch, using the local.conf from https://gist.github.com/stephenfin/0c0437dc6f74c4a2c0baef86bc591678.
git clone https://github.com/openstack/devstack
cd devstack
git checkout $branch_or_tag
# save local.conf with correct branch set
./stack.sh
Testing
Once deployed, pre-create an instance since this is mostly irrelevant and images do differ:
OS_CLOUD=bobcat-jammy openstack server create --flavor m1.tiny --image cirros-0.6.2-x86_64-disk --no-network --wait test-server
OS_CLOUD=caracal-jammy openstack server create --flavor m1.tiny --image cirros-0.6.2-x86_64-disk --no-network --wait test-server
OS_CLOUD=dalmatian-jammy openstack server create --flavor m1.tiny --image cirros-0.6.2-x86_64-disk --no-network --wait test-server
OS_CLOUD=dalmatian-noble openstack server create --flavor m1.tiny --image cirros-0.6.2-x86_64-disk --no-network --wait test-server
OS_CLOUD=epoxy-jammy openstack server create --flavor m1.tiny --image cirros-0.6.3-x86_64-disk --no-network --wait test-server
OS_CLOUD=epoxy-noble openstack server create --flavor m1.tiny --image cirros-0.6.3-x86_64-disk --no-network --wait test-server
OS_CLOUD=flamingo-noble openstack server create --flavor m1.tiny --image cirros-0.6.3-x86_64-disk --no-network --wait test-server
[!NOTE] The images used were:
- Bobcat:
cirros-0.6.2-x86_64-disk- Caracal:
cirros-0.6.2-x86_64-disk- Dalmatian:
cirros-0.6.2-x86_64-disk- Epoxy:
cirros-0.6.3-x86_64-disk- Flamingo:
cirros-0.6.3-x86_64-diskThe
m1.tinyflavor was used for all clouds.
Create a local clouds.yaml with 7 identical entries, changing only the cloud name and the IP address for each entry. Finally ran the create-delete-volume.py script from https://gist.github.com/stephenfin/0c0437dc6f74c4a2c0baef86bc591678:
virtualenv venv
source venv/bin/activate
pip install openstacksdk
OS_CLOUD=bobcat-jammy ./create-delete-volume.py
OS_CLOUD=caracal-jammy ./create-delete-volume.py
OS_CLOUD=dalmatian-jammy ./create-delete-volume.py
OS_CLOUD=dalmatian-noble ./create-delete-volume.py
OS_CLOUD=epoxy-jammy ./create-delete-volume.py
OS_CLOUD=epoxy-noble ./create-delete-volume.py
OS_CLOUD=flamingo-noble ./create-delete-volume.py
Outputs
This gave me the following outputs:
stephenfin-cinder-perf-bobcat-jammy
DevStack Version: 2023.2
Change: daa3ed62d38daadecfecccc022655deb65e81141 Update glance image size limit 2025-02-13 11:37:41 +0000
OS Version: Ubuntu 22.04 jammy
Script executed in 19.22 seconds.
stephenfin-cinder-perf-caracal-jammy
DevStack Version: 2024.1
Change: ee3cba60bd4fdce274bd3124b3489a042805bb18 Switch ZSWAP_ZPOOL to zsmalloc 2025-07-23 10:29:27 +0100
OS Version: Ubuntu 22.04 jammy
Script executed in 41.88 seconds.
stephenfin-cinder-perf-dalmatian-jammy
DevStack Version: 2024.2
Change: bea1b15527006007ef95b7ff7e81a9f53e2ba3a6 Switch ZSWAP_ZPOOL to zsmalloc 2025-07-23 10:28:30 +0100
OS Version: Ubuntu 22.04 jammy
Script executed in 44.45 seconds.
stephenfin-cinder-perf-dalmatian-noble
DevStack Version: 2024.2
Change: bea1b15527006007ef95b7ff7e81a9f53e2ba3a6 Switch ZSWAP_ZPOOL to zsmalloc 2025-07-23 10:28:30 +0100
OS Version: Ubuntu 24.04 noble
Script executed in 50.58 seconds
stephenfin-cinder-perf-epoxy-jammy
DevStack Version: 2025.1
Change: 62537e6d3e47d46d415c669f51c432d7e8f1bf9e Switch ZSWAP_ZPOOL to zsmalloc 2025-07-23 10:27:47 +0100
OS Version: Ubuntu 22.04 jammy
Script executed in 47.83 seconds.
stephenfin-cinder-perf-epoxy-noble
DevStack Version: 2025.1
Change: 62537e6d3e47d46d415c669f51c432d7e8f1bf9e Switch ZSWAP_ZPOOL to zsmalloc 2025-07-23 10:27:47 +0100
OS Version: Ubuntu 24.04 noble
Script executed in 47.77 seconds.
stephenfin-cinder-perf-flamingo-noble
DevStack Version: 2025.2
Change: bfa9e547a901df5dd74926385010421157b6fca7 Avoid setting iso image in tempest config 2025-07-26 01:11:20 +0000
OS Version: Ubuntu 24.04 noble
Script executed in 46.48 seconds.
So we've nailed this down to a bug in Nova and/or Libvirt https://bugs.launchpad.net/nova/+bug/2119114. Fixing that has the potential to be a long, difficult process so I'm attempting to work around it here for now. Hopefully the workaround won't be needed for long :crossed_fingers:
/lgtm /approve
[APPROVALNOTIFIER] This PR is APPROVED
This pull-request has been approved by: kayrus
The full list of commands accepted by this bot can be found here.
The pull request process is described here
- ~~OWNERS~~ [kayrus]
Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment