cloud-provider-openstack icon indicating copy to clipboard operation
cloud-provider-openstack copied to clipboard

tests: Bump DevStack to Dalmatian (2024.2)

Open stephenfin opened this issue 11 months ago • 25 comments
trafficstars

What this PR does / why we need it:

Bump the version of DevStack used in CI from Bobcat (2023.2), which is now EOL, to Dalmatian (2024.2). A future change will bump this further to Epoxy (2025.2).

Which issue this PR fixes(if applicable):

(none)

Special notes for reviewers:

(none)

Release note:

NONE

stephenfin avatar Dec 09 '24 12:12 stephenfin

/hold

This is the second attempt after the first was reverted (#2730). I need to see how this performs. fwiw though, I saw no performance issues locally.

stephenfin avatar Dec 09 '24 12:12 stephenfin

@stephenfin see #2730

kayrus avatar Dec 09 '24 12:12 kayrus

@stephenfin see #2730

Yup, see my comment right above :smile:

stephenfin avatar Dec 09 '24 17:12 stephenfin

I wonder if https://github.com/kubernetes/cloud-provider-openstack/pull/2747 would help.

EmilienM avatar Dec 12 '24 18:12 EmilienM

/retest

kayrus avatar Dec 12 '24 22:12 kayrus

/test openstack-cloud-csi-manila-e2e-test previously manila tests took 49m29s cinder tests took 1h50m18s and failed due to timeout

kayrus avatar Dec 12 '24 22:12 kayrus

/test openstack-cloud-csi-manila-e2e-test

kayrus avatar Dec 12 '24 22:12 kayrus

@EmilienM looks like the #2747 doesn't help

kayrus avatar Dec 13 '24 11:12 kayrus

The Kubernetes project currently lacks enough contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

  • Mark this PR as fresh with /remove-lifecycle stale
  • Close this PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot avatar Mar 13 '25 12:03 k8s-triage-robot

/remove-lifecycle stale

kayrus avatar Mar 13 '25 12:03 kayrus

Error due to missing zpool module param:

+ lib/host:configure_zswap:45              :   sudo tee /sys/module/zswap/parameters/zpool
z3fold
tee: /sys/module/zswap/parameters/zpool: No such file or directory 

However, once again we appear to have ended up with a Jammy image despite requesting Noble :confused: Investigating.

stephenfin avatar May 08 '25 17:05 stephenfin

@stephenfin thanks for picking this up, fwiw..

https://review.opendev.org/c/openstack/devstack/+/942755

also, expect to see some failures because of:

https://github.com/kubernetes/cloud-provider-openstack/issues/2884

mnaser avatar May 08 '25 18:05 mnaser

also, expect to see some failures because of:

#2884

Thanks. It might make sense to stick with 2024.2, fix that, then bump to 2025.2 so. Will think on it :thinking:

stephenfin avatar May 08 '25 18:05 stephenfin

Okay, we're finally seeing ubuntu-24.04 image :pray: From the logs of one of the jobs

23.48 - Thu, 08 May 2025 18:20:39 +0000 - v. 24.4.1-0ubuntu0~24.04.3

stephenfin avatar May 08 '25 18:05 stephenfin

also, expect to see some failures because of: #2884

Thanks. It might make sense to stick with 2024.2, fix that, then bump to 2025.2 so. Will think on it 🤔

I've done this.

stephenfin avatar May 08 '25 18:05 stephenfin

Turns out we were never running against Ubuntu 24.04. While Boskos reaps networks, instances, disks etc., it doesn't reap images. We've likely been using the same (Ubuntu 24.04) image for who knows how long at this point :sweat_smile:

https://github.com/kubernetes-sigs/boskos/blob/5993cef5a1c719c33c0936d416b7d935058e1204/cmd/janitor/gcp_janitor.py#L38

stephenfin avatar May 08 '25 18:05 stephenfin

Investigating the performance degradation by comparing two recent builds: the last passing one and this failing one.

DevStack is about 60% slower to deploy at 467 seconds (7m47s) versus 652 seconds (10m52s), but that's so small and so variable (based on other failures in between) as to be irrelevant. Looks like it's the tests themselves that take longer. I'm going to rework things so we actually get a response back from ginkgo if the test run fails.

stephenfin avatar May 12 '25 12:05 stephenfin

Looks like there are some very significant changes in runtime for tests across the board. Now to figure out why. I've been using the below script to compare results from JUnit files (specifically, the JUnit files from the last success and the most recent failure). The result can be seen in results.csv.

#!/usr/bin/env python3

import csv
import pprint

from lxml import etree


def diff(before: str, after: str):
    with open(before) as fh:
        passing = etree.parse(fh)

    with open(after) as fh:
        failing = etree.parse(fh)

    passing_results = {}
    results_diff = {}

    for testcase in passing.findall('.//testcase'):
        passing_results[testcase.get('name')] = (
            testcase.get('status'), testcase.get('time')
        )

    for testcase in failing.findall('.//testcase'):
        name = testcase.get('name')
        if name not in passing_results:
            raise Exception('tests missing from runs: this should not happen')

        if (
            testcase.get('status') != passing_results[name][0] or
            testcase.get('status') != 'skipped'
        ):
            results_diff[testcase.get('name')] = {
                'before': passing_results[name],
                'after': (testcase.get('status'), testcase.get('time')),
            }

    with open('results.csv', 'w', newline='') as fh:
        writer = csv.writer(fh)

        for name, diff in results_diff.items():
            if name in {
                '[ReportBeforeSuite]',
                '[SynchronizedBeforeSuite]',
                '[SynchronizedAfterSuite]',
                '[ReportAfterSuite] Kubernetes e2e suite report',
            }:
                continue

            if diff['before'][0] != diff['after'][0]:
                # we might want to look at this later
                continue

            before_sec = float(diff['before'][1])
            after_sec = float(diff['after'][1])

            diff_sec = ((after_sec - before_sec) / before_sec) * 100
            print(f'{name}')
            print(f'\tbefore: {before_sec:0.2f} seconds')
            print(f'\tafter:  {after_sec:0.2f} seconds')
            print(f'\tchange: {diff_sec:0.2f}%')

            writer.writerow([name, before_sec, after_sec, diff_sec])


def main():
    import argparse
    parser = argparse.ArgumentParser()
    parser.add_argument(
        'before',
        help='Before result (passing)',
    )
    parser.add_argument(
        'after',
        help='After result (failing)',
    )
    args = parser.parse_args()
    diff(args.before, args.after)


if __name__ == '__main__':
    main()

stephenfin avatar May 12 '25 16:05 stephenfin

Let's see if we get the same performance issues in Caracal, since that's cuts our diff in half. Proposed here.

stephenfin avatar May 12 '25 17:05 stephenfin

Let's see if we get the same performance issues in Caracal, since that's cuts our diff in half. Proposed https://github.com/kubernetes/cloud-provider-openstack/pull/2888.

Tangentially, since we have limited resources, i think in this repo, we should only test with SLURP releases.. i.e., 2024.1 and 2025.1 are more appropriate/relevant than the .2 releases due to their popularity .. we could override this for individual test jobs if necessary to a .2 release..

gouthampacha avatar May 12 '25 17:05 gouthampacha

Let's see if we get the same performance issues in Caracal, since that's cuts our diff in half. Proposed #2888.

Tangentially, since we have limited resources, i think in this repo, we should only test with SLURP releases.. i.e., 2024.1 and 2025.1 are more appropriate/relevant than the .2 releases due to their popularity .. we could override this for individual test jobs if necessary to a .2 release..

I agree.

stephenfin avatar May 13 '25 12:05 stephenfin

Further notes for self from my debugging here. Dumping them here in case they're useful to anyone else. I still haven't gotten to the bottom of this.


I've deployed two VMs, both running Ubuntu 22.04 and with the standard ubuntu user. I've then run the following command from the tip of currently master to deploy bobcat on the first:

❯ ansible-playbook -v --user ubuntu --inventory <IP address>, --ssh-common-args "-o StrictHostKeyChecking=no" tests/playbooks/test-csi-cinder-e2e.yaml

Which is the same thing we do in CI. Because I'm hitting docker rate limits, I only run the first 3 roles (install-golang, install-devstack, install-docker) and comment out the rest. I then log in as root and both (a) log in to docker.io (docker login) and (b) clone this repo to /root/src/k8s.io/cloud-provider-openstack. After this, I run the remaining roles, commenting out the Run functional tests for csi-cinder-plugin step from tests/playbooks/roles/install-csi-cinder/tasks/main.yaml since I don't want to run these during "installation".

I then repeat this on the second node, but with the following diff:

diff --git tests/playbooks/roles/install-devstack/defaults/main.yaml tests/playbooks/roles/install-devstack/defaults/main.yaml
index 8a2839dd9..950a17573 100644
--- tests/playbooks/roles/install-devstack/defaults/main.yaml
+++ tests/playbooks/roles/install-devstack/defaults/main.yaml
@@ -1,7 +1,7 @@
 ---
 user: "stack"
 workdir: "/home/{{ user }}/devstack"
-branch: "2023.2-eol"
+branch: "stable/2024.1"
 enable_services:
   - nova
   - glance
diff --git tests/playbooks/roles/install-devstack/templates/local.conf.j2 tests/playbooks/roles/install-devstack/templates/local.conf.j2
index 3ec7710a9..c896a73b2 100644
--- tests/playbooks/roles/install-devstack/templates/local.conf.j2
+++ tests/playbooks/roles/install-devstack/templates/local.conf.j2
@@ -39,7 +39,7 @@ ENABLE_SYSCTL_NET_TUNING=true
 # increase in swap performance by reducing the amount of data
 # written to disk. the overall speedup is porportional to the
 # compression ratio and the speed of the swap device.
-ENABLE_ZSWAP=true
+ENABLE_ZSWAP=false

 {% if "nova" in enable_services %}
 # Nova

(I really should have done the latter change across both branches, but I forgot. I'll do it if I need to redeploy. I've proposed https://review.opendev.org/c/openstack/devstack/+/955670 through https://review.opendev.org/c/openstack/devstack/+/955672 to avoid the need to do this in the future).

Finally, I run a single test across both, since it's a reliable reproducer of the test performance issues, using the below script (run as root):

export GOPATH=/root
export PATH=/usr/local/go/bin:/root/bin:$PATH
export KUBECONFIG=/root/.kube/config

pushd /root/src/k8s.io/cloud-provider-openstack
/tmp/kubernetes/test/bin/e2e.test \
  -storage.testdriver=tests/e2e/csi/cinder/test-driver.yaml \
  --ginkgo.focus='External Storage \[Driver: cinder\.csi\.openstack\.org\] \[Testpattern: Dynamic PV \(block volmode\)\] volumeMode should fail to use a volume in a pod with mismatched mode \[Slow\]' \
  --ginkgo.no-color \
  --ginkgo.v \
  --ginkgo.timeout=24h \
  -test.timeout=0

This is currently giving me different results on bobcat and dalmatian:

# bobcat
Ran 1 of 6920 Specs in 27.589 seconds

# dalmatian
Ran 1 of 6920 Specs in 71.706 seconds

stephenfin avatar Jul 23 '25 10:07 stephenfin

@stephenfin so what is the main reason for tests speed degradation? ZSWAP?

kayrus avatar Jul 23 '25 12:07 kayrus

No, I still don't know. I'm still debugging it. So far, I have bumped all the services (for each service switch to the stable/2024.1 branch, run db sync commands, and restart) and compared all the configuration files (nothing out of the ordinary). I'm now bumping dependencies.

stephenfin avatar Jul 23 '25 14:07 stephenfin

Still no luck. I've proceeded (a) bumped all OpenStack services, (b) compared configs for differences, and (c) bumped all other Python dependencies in the global venv. No dice: the performance is still good on the Bobcat VM and poor on the Dalmatian VM. I've shared my changes in the pastebin below in case they are useful to anyone. I'll keep investigating tomorrow.

https://paste.opendev.org/show/bcnPKJz1oFhgNlBVYvPo/

stephenfin avatar Jul 23 '25 17:07 stephenfin

Still no luck, but I have managed to take k3s and CPO out of the loop and can reliably reproduce the issue using a simple script.

https://gist.github.com/stephenfin/0c0437dc6f74c4a2c0baef86bc591678

For some reason, nova is not receiving the event from libvirt and is timing out, as seen in the WARNING log below.

# bobcat

Jul 24 15:44:11 stephenfin-cpo-debug-old nova-compute[220178]: INFO nova.virt.block_device [None req-fe3eb4a7-dde9-410c-b345-421295aed1ae demo demo] [instance: 13587445-eab9-49fe-8b59-e48da2a005ee] Attempting to driver detach volume 8f936215-6a0b-4c99-ab3c-47926bde6d55 from mountpoint /dev/vdb
Jul 24 15:44:11 stephenfin-cpo-debug-old nova-compute[220178]: INFO nova.virt.libvirt.driver [None req-fe3eb4a7-dde9-410c-b345-421295aed1ae demo demo] Successfully detached device vdb from instance 13587445-eab9-49fe-8b59-e48da2a005ee from the live domain config.

# dalamatian

Jul 24 15:48:56 stephenfin-cpo-debug-new nova-compute[68061]: INFO nova.virt.block_device [None req-672548ab-5a3e-4698-b5cd-0127af0d4358 demo demo] [instance: e8d15440-1442-49c3-9356-4592cdb697d2] Attempting to driver detach volume df0017a1-b3c5-4f30-b48c-7c6da9c763a2 from mountpoint /dev/vdb
Jul 24 15:49:16 stephenfin-cpo-debug-new nova-compute[68061]: WARNING nova.virt.libvirt.driver [None req-672548ab-5a3e-4698-b5cd-0127af0d4358 demo demo] Waiting for libvirt event about the detach of device vdb with device alias ua-df0017a1-b3c5-4f30-b48c-7c6da9c763a2 from instance e8d15440-1442-49c3-9356-4592cdb697d2 is timed out.
Jul 24 15:49:16 stephenfin-cpo-debug-new nova-compute[68061]: INFO nova.virt.libvirt.driver [None req-672548ab-5a3e-4698-b5cd-0127af0d4358 demo demo] Successfully detached device vdb from instance e8d15440-1442-49c3-9356-4592cdb697d2 from the live domain config.

Now that I can reproduce this outside of k3s/CPO, I can start testing different combinations to see where it's broken and where it's not.

  • Dalmatian + Ubuntu 24.04
  • Epoxy + Ubuntu 22.04
  • Epoxy + Ubuntu 24.04
  • Flamingo + Ubuntu 24.04 (22.04 is not supported)

To be continued next week.

stephenfin avatar Jul 24 '25 16:07 stephenfin

I continued my testing. It seems every release is broken since Bobcat, regardless of Ubuntu version.

OpenStack Ubuntu Result
2023.2 (Bobcat) 22.04 (Jammy) 19.22 seconds
2024.1 (Caracal) 22.04 (Jammy) 41.88 seconds
2024.2 (Dalmatian) 22.04 (Jammy) 44.45 seconds
2024.2 (Dalmatian) 24.04 (Noble) 50.58 seconds
2025.1 (Epoxy) 22.04 (Jammy) 47.83 seconds
2025.1 (Epoxy) 24.04 (Noble) 47.77 seconds
2025.2 (Flamingo) 24.04 (Noble) 46.48 seconds

[!NOTE] Caracal only supported Ubuntu 24.04 (Noble) as experimental, while Flamingo does not support Ubuntu 24.04 (Jammy), so both of these cases are skipped

Notes below for completeness.


Test runtimes

  • Bobcat: https://governance.openstack.org/tc/reference/runtimes/2023.2.html
  • Carcal: https://governance.openstack.org/tc/reference/runtimes/2024.1.html
  • Dalmatian: https://governance.openstack.org/tc/reference/runtimes/2024.2.html
  • Epoxy: https://governance.openstack.org/tc/reference/runtimes/2025.1.html
  • Flamingo: https://governance.openstack.org/tc/reference/runtimes/2025.2.html

Libvirt versions

We update everything before starting deployment, so we'll get the latest version of Libvirt and QEMU. For Ubuntu 22.04 (Jammy), this is:

$ virsh version
Compiled against library: libvirt 8.0.0
Using library: libvirt 8.0.0
Using API: QEMU 8.0.0
Running hypervisor: QEMU 6.2.0

For Ubuntu 24.04 (Noble), this is:

$ virsh version
Compiled against library: libvirt 10.0.0
Using library: libvirt 10.0.0
Using API: QEMU 10.0.0
Running hypervisor: QEMU 8.2.2

Setup

The following were run on all VMs.

Update VM and reboot to try eliminate distro issues.

sudo apt update && sudo apt upgrade -y && sudo reboot

Deploy DevStack with correct branch, using the local.conf from https://gist.github.com/stephenfin/0c0437dc6f74c4a2c0baef86bc591678.

git clone https://github.com/openstack/devstack
cd devstack
git checkout $branch_or_tag
# save local.conf with correct branch set
./stack.sh

Testing

Once deployed, pre-create an instance since this is mostly irrelevant and images do differ:

OS_CLOUD=bobcat-jammy openstack server create --flavor m1.tiny --image cirros-0.6.2-x86_64-disk --no-network --wait test-server
OS_CLOUD=caracal-jammy openstack server create --flavor m1.tiny --image cirros-0.6.2-x86_64-disk --no-network --wait test-server
OS_CLOUD=dalmatian-jammy openstack server create --flavor m1.tiny --image cirros-0.6.2-x86_64-disk --no-network --wait test-server
OS_CLOUD=dalmatian-noble openstack server create --flavor m1.tiny --image cirros-0.6.2-x86_64-disk --no-network --wait test-server
OS_CLOUD=epoxy-jammy openstack server create --flavor m1.tiny --image cirros-0.6.3-x86_64-disk --no-network --wait test-server
OS_CLOUD=epoxy-noble openstack server create --flavor m1.tiny --image cirros-0.6.3-x86_64-disk --no-network --wait test-server
OS_CLOUD=flamingo-noble openstack server create --flavor m1.tiny --image cirros-0.6.3-x86_64-disk --no-network --wait test-server

[!NOTE] The images used were:

  • Bobcat: cirros-0.6.2-x86_64-disk
  • Caracal: cirros-0.6.2-x86_64-disk
  • Dalmatian: cirros-0.6.2-x86_64-disk
  • Epoxy: cirros-0.6.3-x86_64-disk
  • Flamingo: cirros-0.6.3-x86_64-disk

The m1.tiny flavor was used for all clouds.

Create a local clouds.yaml with 7 identical entries, changing only the cloud name and the IP address for each entry. Finally ran the create-delete-volume.py script from https://gist.github.com/stephenfin/0c0437dc6f74c4a2c0baef86bc591678:

virtualenv venv
source venv/bin/activate
pip install openstacksdk
OS_CLOUD=bobcat-jammy ./create-delete-volume.py
OS_CLOUD=caracal-jammy ./create-delete-volume.py
OS_CLOUD=dalmatian-jammy ./create-delete-volume.py
OS_CLOUD=dalmatian-noble ./create-delete-volume.py
OS_CLOUD=epoxy-jammy ./create-delete-volume.py
OS_CLOUD=epoxy-noble ./create-delete-volume.py
OS_CLOUD=flamingo-noble ./create-delete-volume.py

Outputs

This gave me the following outputs:

stephenfin-cinder-perf-bobcat-jammy

DevStack Version: 2023.2
Change: daa3ed62d38daadecfecccc022655deb65e81141 Update glance image size limit 2025-02-13 11:37:41 +0000
OS Version: Ubuntu 22.04 jammy
Script executed in 19.22 seconds.

stephenfin-cinder-perf-caracal-jammy

DevStack Version: 2024.1
Change: ee3cba60bd4fdce274bd3124b3489a042805bb18 Switch ZSWAP_ZPOOL to zsmalloc 2025-07-23 10:29:27 +0100
OS Version: Ubuntu 22.04 jammy
Script executed in 41.88 seconds.

stephenfin-cinder-perf-dalmatian-jammy

DevStack Version: 2024.2
Change: bea1b15527006007ef95b7ff7e81a9f53e2ba3a6 Switch ZSWAP_ZPOOL to zsmalloc 2025-07-23 10:28:30 +0100
OS Version: Ubuntu 22.04 jammy
Script executed in 44.45 seconds.

stephenfin-cinder-perf-dalmatian-noble

DevStack Version: 2024.2
Change: bea1b15527006007ef95b7ff7e81a9f53e2ba3a6 Switch ZSWAP_ZPOOL to zsmalloc 2025-07-23 10:28:30 +0100
OS Version: Ubuntu 24.04 noble
Script executed in 50.58 seconds

stephenfin-cinder-perf-epoxy-jammy

DevStack Version: 2025.1
Change: 62537e6d3e47d46d415c669f51c432d7e8f1bf9e Switch ZSWAP_ZPOOL to zsmalloc 2025-07-23 10:27:47 +0100
OS Version: Ubuntu 22.04 jammy
Script executed in 47.83 seconds.

stephenfin-cinder-perf-epoxy-noble

DevStack Version: 2025.1
Change: 62537e6d3e47d46d415c669f51c432d7e8f1bf9e Switch ZSWAP_ZPOOL to zsmalloc 2025-07-23 10:27:47 +0100
OS Version: Ubuntu 24.04 noble
Script executed in 47.77 seconds.

stephenfin-cinder-perf-flamingo-noble

DevStack Version: 2025.2
Change: bfa9e547a901df5dd74926385010421157b6fca7 Avoid setting iso image in tempest config 2025-07-26 01:11:20 +0000
OS Version: Ubuntu 24.04 noble
Script executed in 46.48 seconds.

stephenfin avatar Jul 30 '25 11:07 stephenfin

So we've nailed this down to a bug in Nova and/or Libvirt https://bugs.launchpad.net/nova/+bug/2119114. Fixing that has the potential to be a long, difficult process so I'm attempting to work around it here for now. Hopefully the workaround won't be needed for long :crossed_fingers:

stephenfin avatar Jul 31 '25 13:07 stephenfin

/lgtm /approve

kayrus avatar Jul 31 '25 17:07 kayrus

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: kayrus

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment Approvers can cancel approval by writing /approve cancel in a comment

k8s-ci-robot avatar Jul 31 '25 17:07 k8s-ci-robot