terraform-provider-libvirt icon indicating copy to clipboard operation
terraform-provider-libvirt copied to clipboard

"wait_for_lease = true" does not take effect

Open SJFCS opened this issue 1 year ago • 8 comments

System Information

Linux distribution

Archlinux

Terraform version

terraform -v
Terraform v1.9.4
on linux_amd64

Provider and libvirt versions

terraform-provider-libvirt -version
0.7.6

Description of Issue/Question

I use the following simple configuration. It installs qemu-ga through cloud-init. When using terraform-provider-libvirt 0.7.6 version, "qemu_agent = true", "wait_for_lease = true" will not wait for qemu-ga to obtain the IP, prompting "Error: couldn't retrieve IP address". Only after changing the version to 0.7.1, try terraform init -upgrade, and then apply, it will wait for qemu-ga to obtain the IP.

Please excuse my poor English.

Setup

this is my main.tf

terraform {
  required_version = ">= 0.13"
  required_providers {
    libvirt = {
      source  = "dmacvicar/libvirt"
      version = "0.7.6"
    }
  }
}

provider "libvirt" {
  uri = "qemu:///system"
}

data "template_file" "user_data" {
  template = file("${path.module}/cloud_init/cloud_init.yml")
}

data "template_file" "network_config" {
  template = file("${path.module}/cloud_init/network_config.yml")
}

resource "libvirt_cloudinit_disk" "cloudinit" {
  name           = "cloudinit.iso"
  user_data      = data.template_file.user_data.rendered
  network_config = data.template_file.network_config.rendered
  pool           = "default"
}

resource "libvirt_volume" "debian9-qcow2" {
  name   = "debian9-qcow2"
  pool   = "default"
  source = "./ubuntu-24.04-server-cloudimg-amd64.img"
}

// set boot order hd, network
resource "libvirt_domain" "domain-debian9-qcow2" {
  name       = "debian9"
  memory     = "1024"
  vcpu       = 1
  qemu_agent = true
  cloudinit  = libvirt_cloudinit_disk.cloudinit.id

  network_interface {
    bridge         = "br0"
    wait_for_lease = true
  }

  boot_device {
    dev = ["hd", "network"]
  }

  disk {
    volume_id = libvirt_volume.debian9-qcow2.id
  }

  graphics {
    type        = "spice"
    listen_type = "address"
    autoport    = true
  }
  provisioner "remote-exec" {
    inline = [
      <<-EOF
        sudo apt-get update 
        sudo apt-get install nginx -y
        EOF
    ]
  }
  connection {
    type = "ssh"
    user = "ubuntu"
    host = self.network_interface[0].addresses[0] 
    private_key = file("~/.ssh/id_ed25519")
    timeout = "2m"
  }
}

this is cloud_init.yml

#cloud-config

bootcmd:
  - echo "This is a boot command"
runcmd:
  - [sh, -xc, "echo $(date) ': hello world!'"]
  - sudo apt-get update 
  - sudo apt-get install qemu-guest-agent -y
  - sudo systemctl enable --now qemu-guest-agent.service
ssh_pwauth: true
disable_root: false
users:
  - name: root
    plain_text_passwd: 'password'
    lock_passwd: false
  - name: ubuntu
    sudo: ALL=(ALL) NOPASSWD:ALL
    groups: users, admin
    home: /home/ubuntu
    shell: /bin/bash
    lock_passwd: false
    ssh-authorized-keys:
      - ssh-ed25519 Axxx5 [email protected]

network_config.yml

version: 2
ethernets:
  ens3:
    dhcp4: true

Steps to Reproduce Issue

0.7.6 is doesn't work step:

  1. terraform init
  2. TF_LOG=DEBUG terraform apply -auto-approve when use 0.7.6 debug:
          </graphics>
          <rng model="virtio">
              <backend model="random">/dev/urandom</backend>
          </rng>
      </devices>
  </domain>: timestamp="2024-08-14T23:28:10.086+0800"
2024-08-14T23:28:10.435+0800 [INFO]  provider.terraform-provider-libvirt_v0.7.6: 2024/08/14 23:28:10 [INFO] Domain ID: 1e643687-5914-469d-b5c8-356c5dc65790: timestamp="2024-08-14T23:28:10.435+0800"
2024-08-14T23:28:10.435+0800 [INFO]  provider.terraform-provider-libvirt_v0.7.6: 2024/08/14 23:28:10 [DEBUG] Waiting for state to become: [all-addresses-obtained]: timestamp="2024-08-14T23:28:10.435+0800"
2024-08-14T23:28:15.441+0800 [INFO]  provider.terraform-provider-libvirt_v0.7.6: 2024/08/14 23:28:15 [DEBUG] waiting for network address for iface=52:54:00:16:93:28: timestamp="2024-08-14T23:28:15.440+0800"
2024-08-14T23:28:15.441+0800 [INFO]  provider.terraform-provider-libvirt_v0.7.6: 2024/08/14 23:28:15 [DEBUG] qemu-agent used to query interface info: timestamp="2024-08-14T23:28:15.441+0800"
2024-08-14T23:28:15.443+0800 [ERROR] provider.terraform-provider-libvirt_v0.7.6: Response contains error diagnostic: diagnostic_severity=ERROR tf_proto_version=5.3 tf_provider_addr=provider @caller=github.com/hashicorp/[email protected]/tfprotov5/internal/diag/diagnostics.go:55 @module=sdk.proto tf_req_id=3443beee-8402-aa9f-8e77-364a3bd03a5e tf_resource_type=libvirt_domain tf_rpc=ApplyResourceChange diagnostic_detail=""
  diagnostic_summary=
  | couldn't retrieve IP address of domain id: 1e643687-5914-469d-b5c8-356c5dc65790. Please check following: 
  | 1) is the domain running properly? 
  | 2) has the network interface an IP address? 
  | 3) Networking issues on your libvirt setup? 
  |  4) is DHCP enabled on this Domain's network? 
  | 5) if you use bridge network, the domain should have the pkg qemu-agent installed 
  | IMPORTANT: This error is not a terraform libvirt-provider error, but an error caused by your KVM/libvirt infrastructure configuration/setup 

0.7.1 is work step:

  1. just change this:
    libvirt = {
      source  = "dmacvicar/libvirt"
      version = "0.7.1"
    }
  1. terraform init -upgrade
  2. TF_LOG=DEBUG terraform apply -auto-approve

when use 0.7.1 debug:

2024-08-14T23:26:07.000+0800 [INFO]  provider.terraform-provider-libvirt_v0.7.1: 2024/08/14 23:26:07 [DEBUG] waiting for network address for iface=52:54:00:7E:A5:63: timestamp="2024-08-14T23:26:07.000+0800"
2024-08-14T23:26:07.000+0800 [INFO]  provider.terraform-provider-libvirt_v0.7.1: 2024/08/14 23:26:07 [DEBUG] qemu-agent used to query interface info: timestamp="2024-08-14T23:26:07.000+0800"
2024-08-14T23:26:07.001+0800 [INFO]  provider.terraform-provider-libvirt_v0.7.1: 2024/08/14 23:26:07 [DEBUG] Interfaces info obtained with libvirt API:
([]libvirt.DomainInterface) <nil>: timestamp="2024-08-14T23:26:07.001+0800"
2024-08-14T23:26:07.001+0800 [INFO]  provider.terraform-provider-libvirt_v0.7.1: 2024/08/14 23:26:07 [DEBUG] ifaces with addresses: []: timestamp="2024-08-14T23:26:07.001+0800"
2024-08-14T23:26:07.001+0800 [INFO]  provider.terraform-provider-libvirt_v0.7.1: 2024/08/14 23:26:07 [DEBUG] 52:54:00:7E:A5:63 doesn't have IP address(es) yet...: timestamp="2024-08-14T23:26:07.001+0800"
2024-08-14T23:26:07.001+0800 [INFO]  provider.terraform-provider-libvirt_v0.7.1: 2024/08/14 23:26:07 [DEBUG] IP address not found for iface=52:54:00:7E:A5:63: will try in a while: timestamp="2024-08-14T23:26:07.001+0800"
2024-08-14T23:26:07.001+0800 [INFO]  provider.terraform-provider-libvirt_v0.7.1: 2024/08/14 23:26:07 [TRACE] Waiting 10s before next try: timestamp="2024-08-14T23:26:07.001+0800"
libvirt_domain.domain-ubuntu: Still creating... [40s elapsed]
2024-08-14T23:26:17.010+0800 [INFO]  provider.terraform-provider-libvirt_v0.7.1: 2024/08/14 23:26:17 [DEBUG] waiting for network address for iface=52:54:00:7E:A5:63: timestamp="2024-08-14T23:26:17.010+0800"
2024-08-14T23:26:17.010+0800 [INFO]  provider.terraform-provider-libvirt_v0.7.1: 2024/08/14 23:26:17 [DEBUG] qemu-agent used to query interface info: timestamp="2024-08-14T23:26:17.010+0800"
2024-08-14T23:26:17.013+0800 [INFO]  provider.terraform-provider-libvirt_v0.7.1: 2024/08/14 23:26:17 [DEBUG] Interfaces info obtained with libvirt API:
([]libvirt.DomainInterface) (len=2 cap=2) {

(Include debug logs if possible and relevant).


Additional information:

Do you have SELinux or Apparmor/Firewall enabled? Some special configuration? NO

SJFCS avatar Aug 14 '24 15:08 SJFCS

Hello, could you try to get an specify wait_For_lease using an image that already has qemu-guest-agent installed? I had successfully get IP address from VM when doing so.

scabala avatar Sep 02 '24 19:09 scabala

Hello, could you try to get an specify wait_For_lease using an image that already has qemu-guest-agent installed? I had successfully get IP address from VM when doing so.

Thank you for the method you provided I haven't tried to use an image with qemu-guest-agent already installed because I want qemu-guest-agent to be installed automatically during the cloudinit phase, which was possible in previous versions but will not work in the new version

SJFCS avatar Sep 03 '24 01:09 SJFCS

I'll try to take a look and see if I can find anything changed that might cause it between those two versions.

scabala avatar Sep 03 '24 06:09 scabala

I couldn't find anything particular between those versions. Also, I don't have bridged network in my setup and it's hard for me to create it so I used NAT-ed one and I couldn't reproduce it.

@SJFCS could you check if you can reproduce it in different network types? NAT-ed and routed for example?

EDIT: forget what I wrote, I can reproduce it, just used wrong image before :facepalm:

I'll try to bisect and see where problem lies

scabala avatar Sep 18 '24 20:09 scabala

Okay, more debugging later: I cannot reproduce it - previously I had problems with cloud-init. I think it might be related to cloud-init itself rather than to provider.

Either way, I have consisten behavior between 0.7.6 and 0.7.1 - it's either failing if qemu-guest-agent is not installed and started or it is running fine otherwise.

scabala avatar Sep 20 '24 13:09 scabala

I couldn't find anything particular between those versions. Also, I don't have bridged network in my setup and it's hard for me to create it so I used NAT-ed one and I couldn't reproduce it.

@SJFCS could you check if you can reproduce it in different network types? NAT-ed and routed for example?

EDIT: forget what I wrote, I can reproduce it, just used wrong image before 🤦

I'll try to bisect and see where problem lies

The network configuration is the same, I think it has nothing to do with this

SJFCS avatar Sep 21 '24 08:09 SJFCS

Okay, more debugging later: I cannot reproduce it - previously I had problems with cloud-init. I think it might be related to cloud-init itself rather than to provider.

Either way, I have consisten behavior between 0.7.6 and 0.7.1 - it's either failing if qemu-guest-agent is not installed and started or it is running fine otherwise.

Okay, thanks for the troubleshooting, but I did only change the provider version number while keeping the configuration unchanged.

SJFCS avatar Sep 21 '24 08:09 SJFCS

Do you have cloud-init logs for both scenarios?

scabala avatar Sep 21 '24 19:09 scabala

Do you have cloud-init logs for both scenarios?

I have seen the logs in both cases, and they are normal and no errors are reported.

SJFCS avatar Oct 28 '24 13:10 SJFCS

libv

This issue can be reproduced in versions greater than 0.7.1

│ Error: couldn't retrieve IP address of domain id: 3ac397de-13cd-485d-9772-872f7652de0d. Please check following: 
│ 1) is the domain running proplerly? 
│ 2) has the network interface an IP address? 
│ 3) Networking issues on your libvirt setup? 
│  4) is DHCP enabled on this Domain's network? 
│ 5) if you use bridge network, the domain should have the pkg qemu-agent installed 
│ IMPORTANT: This error is not a terraform libvirt-provider error, but an error caused by your KVM/libvirt infrastructure configuration/setup 
│  error retrieving interface addresses: error retrieving interface addresses: Virtual machine agent not responding: QEMU host agent not connected

I found that this is not related to whether the network mode is bridge or nat. To simplify the reproduction process and avoid cloudinit interference, I used the Talos ISO boot image below, which includes qemu-guest-agent and can be booted directly as a boot disk.

The metal-amd64.iso (MD5: ebd98e402606991700d8cb5545e72673) can be downloaded from: https://factory.talos.dev/image/ce4c980550dd2ab1b17bbf2b08801c7eb59418eafe8f279833297925d67c7515/v1.8.2/metal-amd64.iso

You can also build it yourself here: https://factory.talos.dev -> Bare-metal Machine -> choose version -> amd64 -> choose System Extensions qemu-guest-agent

#=====================================================================================
# Providers
#=====================================================================================
terraform {
  required_version = ">= 1.6.0"
  required_providers {
    libvirt = {
      source  = "dmacvicar/libvirt"
      version = "0.7.4"
    }
    template = {
      source  = "hashicorp/template"
      version = "2.2.0"
    }
  }
}

provider "libvirt" {
  uri = "qemu:///system"
}

#=====================================================================================
# Libvirt Pool
#=====================================================================================
resource "libvirt_pool" "kubernetes" {
  name = "talos"
  type = "dir"
  path = "/opt/libvirt-pool/talos"
}

#=====================================================================================
# Network
#=====================================================================================
# resource "libvirt_network" "talos" {
#   name      = "talos"
#   mode      = "bridge"
#   bridge    = "br0" # Use the created bridge network card
#   autostart = true
# }
resource "libvirt_network" "talos" {
  name      = "talos"
  mode      = "nat"
  addresses = ["192.168.123.0/24"]
  autostart = true
}
#=====================================================================================
# Domain
#=====================================================================================
resource "libvirt_domain" "domain-talos" {
  name   = "talos"
  memory = "2048"
  vcpu   = 4
  cpu {
    mode = "host-passthrough"
  }

  qemu_agent = true

  boot_device {
    dev = ["cdrom", "hd", "network"]
  }
  network_interface {
    network_id     = libvirt_network.talos.id
    wait_for_lease = true
  }

  # cdrom
  disk {
    file = "/home/admin/Downloads/images/metal-amd64.iso"
  }
  #=====================================================================================
  # Console
  #=====================================================================================
  console {
    type        = "pty"
    target_port = "0"
    target_type = "serial"
  }

  console {
    type        = "pty"
    target_type = "virtio"
    target_port = "1"
  }

  graphics {
    type        = "spice"
    listen_type = "address"
    autoport    = true
  }
  video {
    type = "virtio"
  }
}

# Output the IP addresses
output "ips" {
  value = {
    ip = libvirt_domain.domain-talos.network_interface[0].addresses
  }
}

Reproduction steps

#     set  version = "0.7.1"
terraform init
terraform apply -auto-approve
terraform destroy -auto-approve
# it work !
#     set  version = "0.7.4"
terraform init -upgrade
terraform apply -auto-approve
# it err !

SJFCS avatar Nov 02 '24 05:11 SJFCS

I wanted to have a look at this issue, but it seems I can reproduce it only with version 0.7.4

Versions 0.7.1, 0.7.6 and 0.8.1 are working fine for me. I pretty much copy-pasted your tf file in the previous comment, minus the template provider.

NamelessOne91 avatar Nov 15 '24 15:11 NamelessOne91

I wanted to have a look at this issue, but it seems I can reproduce it only with version 0.7.4

Versions 0.7.1, 0.7.6 and 0.8.1 are working fine for me. I pretty much copy-pasted your tf file in the previous comment, minus the template provider.

i try it ,on 0.7.6 and 0.8.1 is not working too. ...

SJFCS avatar Nov 28 '24 01:11 SJFCS

After analysis, I discovered the key differences:

  1. In version 0.7.1, the domainGetIfacesInfo function has special logic for error handling:

switch virErr := err.(type) {
case libvirt.Error:
    // Agent can be unresponsive if being installed/setup
    if addrsrc == uint32(libvirt.DomainInterfaceAddressesSrcLease) && virErr.Code != uint32(libvirt.ErrOperationInvalid) ||
        addrsrc == uint32(libvirt.DomainInterfaceAddressesSrcAgent) && virErr.Code != uint32(libvirt.ErrAgentUnresponsive) {
        return interfaces, fmt.Errorf("Error retrieving interface addresses: %w", err)
    }
}
  1. In the latest version(all version after 0.7.1), error handling becomes simpler:

if err != nil {
    return interfaces, fmt.Errorf("error retrieving interface addresses: %w", err)
}

This is the key to the problem:

  1. In version 0.7.1, if an ErrAgentUnresponsive error is encountered when using qemu-agent to obtain an IP address, the code will ignore the error and continue trying, which gives qemu-agent time to start and respond.

  2. In the new version, any error will be returned directly, including ErrAgentUnresponsive, which causes qemu-agent to fail before it has fully started and responded.

@NamelessOne91 @scabala @dmacvicar I submitted a PR 1144

SJFCS avatar Jan 11 '25 11:01 SJFCS