community.general icon indicating copy to clipboard operation
community.general copied to clipboard

`proxmox_kvm` - race condition when creating multiple VMs at the same time

Open eliasp opened this issue 1 month ago • 3 comments

Summary

When using community.general.proxmox_kvm to create multiple VMs at once, only one VM will always succeed, the others might randomly fail to be created due to the vmid being re-used: creation of qemu VM ansible-vm-test4 with vmid 108 failed with exception=500 Internal Server Error: unable to create VM 108 - VM 108 already exists on node '[REDACTED]'

Issue Type

Bug Report

Component Name

proxmox_kvm

Ansible Version

$ ansible --version
ansible [core 2.15.11]
  config file = None
  configured module search path = ['/tmp/.ansible/plugins/modules', '/usr/share/ansible/plugins/modules']
  ansible python module location = /usr/local/lib/python3.9/site-packages/ansible
  ansible collection location = /tmp/.ansible/collections:/usr/share/ansible/collections
  executable location = /usr/local/bin/ansible
  python version = 3.9.18 (main, Jan  4 2024, 00:00:00) [GCC 11.4.1 20230605 (Red Hat 11.4.1-2)] (/usr/bin/python3)
  jinja version = 3.1.3
  libyaml = True

Community.general Version

$ ansible-galaxy collection list community.general

Configuration

$ ansible-config dump --only-changed
# /usr/share/ansible/collections/ansible_collections
Collection        Version
----------------- -------
community.general 8.6.0

OS / Environment

RedHat UBI9.3 Container as Execution Environment Proxmox 8.1.4 as target

Steps to Reproduce

When trying to create multiple Proxmox KVM VMs at the same time, only the creation of 1 VM will reliably succeed. All subsequent VMs fail in most cases, sometimes a 2nd or 3rd one can be created.

hosts.yml

systems:
  vars:
    systems_proxmox_host: some-proxmox-host:8006
    systems_proxmox_node: some-cluster-node
    systems_proxmox_username: some-local-user@pam
    systems_proxmox_password: Hunter2
  hosts:
    ansible-vm-test1:
    ansible-vm-test2:
    ansible-vm-test3:
    ansible-vm-test4:

playbook.yml

---
- name: Manage Workstation VMs
  hosts: systems
  # will be done explicitly once the VM exists
  gather_facts: false

  tasks:
    - name: Get information about existing VM
      community.general.proxmox_vm_info:
        api_host: "{{ systems_proxmox_host }}"
        # FIXME: use a proper/trusted cert
        validate_certs: false
        api_user: "{{ systems_proxmox_username }}"
        api_password: "{{ systems_proxmox_password }}"
        name: "{{ inventory_hostname_short }}"
      register: vm_info
      delegate_to: localhost

    - name: Show VM information
      ansible.builtin.debug:
        var: vm_info.proxmox_vms
      delegate_to: localhost

    - name: Stop processing hosts, where there is a non-unique match of VMs (same name exists multiple times)
      ansible.builtin.assert:
        that:
          - vm_info.proxmox_vms | length < 2
        fail_msg: Not processing VM '{{ inventory_hostname_short }}', since one or more duplicates were found and there's no way to distinguish them
        success_msg: Continuing to process VM '{{ inventory_hostname_short }}', no duplicates were found that could cause uniqueness issues
      delegate_to: localhost

    - name: Create a VM
      community.general.proxmox_kvm:
        api_host: "{{ systems_proxmox_host }}"
        # FIXME: use a proper/trusted cert
        validate_certs: false
        api_user: "{{ systems_proxmox_username }}"
        api_password: "{{ systems_proxmox_password }}"
        name: "{{ inventory_hostname_short }}"
        machine: q35
        memory: 10000
        ostype: l26
        # TODO: make this dynamic
        node: "{{ systems_proxmox_node }}"
      delegate_to: localhost

Execution: ansible-playbook -vv -i hosts.yml playbook.yml

Expected Results

I expected community.general.proxmox_kvm to be able to create multiple VMs at once without failing.

Actual Results

This is caused by:

  • community.general.proxmox_kvm having to provide the vmid to Provmox VE, because it can't determine it on its own during VM creation, it's a required parameter
  • community.general.proxmox_kvm determining the next available vmid at the same time as the other instances of community.general.proxmox_kvm processing this task
TASK [Create a VM] ****************************************************************************************************************************************
task path: /runner/playbook_systems.yml:41
fatal: [ansible-vm-test4 -> localhost]: FAILED! => {"changed": false, "msg": "creation of qemu VM ansible-vm-test4 with vmid 108 failed with exception=500 Internal Server Error: unable to create VM 108 - VM 108 already exists on node '[REDACTED]'", "vmid": "108"}
fatal: [ansible-vm-test1 -> localhost]: FAILED! => {"changed": false, "msg": "creation of qemu VM ansible-vm-test1 with vmid 108 failed with exception=500 Internal Server Error: unable to create VM 108 - VM 108 already exists on node '[REDACTED]'", "vmid": "108"}
changed: [ansible-vm-test2 -> localhost] => {"changed": true, "devices": {}, "mac": {}, "msg": "VM ansible-vm-test2 with vmid 108 deployed", "vmid": 108}
changed: [ansible-vm-test3 -> localhost] => {"changed": true, "devices": {}, "mac": {}, "msg": "VM ansible-vm-test3 with vmid 109 deployed", "vmid": 109}

Since vmid is a required parameter of the PVE API, it needs to be determined by the API client (e.g. in contrast to the vSphere VMOMI API, where the vmoid is generated by the server upon VM creation). This introduces a race condition in our scenario, where multiple processes creating VMs at the same time fight within the window of "determine vmid by querying PVE API for the next available one" and "create VM using the vmid" to be the first one.

The only reasonable fix, is to catch the API response and retry on a failure with another vmid until the VM creation succeeds.

A workaround for users for now is to set throttle: 1 for the corresponding task to prevent multiple processes to run at the same time:

    - name: Create a VM
      community.general.proxmox_kvm:
        api_host: "{{ systems_proxmox_host }}"
        # FIXME: use a proper/trusted cert
        validate_certs: false
        api_user: "{{ systems_proxmox_username }}"
        api_password: "{{ systems_proxmox_password }}"
        name: "{{ inventory_hostname_short }}"
        machine: q35
        memory: 10000
        ostype: l26
        # TODO: make this dynamic
        node: "{{ systems_proxmox_node }}"
      delegate_to: localhost
      throttle: 1

Code of Conduct

  • [X] I agree to follow the Ansible Code of Conduct

eliasp avatar May 08 '24 15:05 eliasp

Files identified in the description:

If these files are incorrect, please update the component name section of the description or use the !component bot command.

click here for bot help

ansibullbot avatar May 08 '24 16:05 ansibullbot

cc @Ajpantuso @Thulium-Drake @UnderGreen @helldorado @joshainglis @karmab @krauthosting click here for bot help

ansibullbot avatar May 08 '24 16:05 ansibullbot

(we had a discussion on this before @eliasp created this issue in #devel:ansible.com on Matrix; if someone is interested in the discussion, the logs for that room are public IIRC)

felixfontein avatar May 08 '24 19:05 felixfontein