zos Deployment changes not reported when workload description or metadata is changed

I noticed some unexpected behavior while testing workload updates. The flow is like this:

Create a deployment
Update the deployment with a change to metadata or description on some workload
Zos accepts without error and executes the change
But, when calling zos.deployment.changes the new version with the changes is not included in the reply

Context

This issue came up while attempting to update deployments created in the Playground using tfgrid-sdk-go code. Since metadata isn't supported in the sdk, it is stripped from deployments converted between Zos and sdk abstractions for workloads, if it exists. Playground attaches metadata to workloads it creates, on the other hand, so an update to remove that metadata is an unavoidable side effect under current implementations.

Steps to reproduce

While some custom code is needed to see the metadata case, since we can't update deployment in the Playground and we can't set metadata using Terraform, we can reproduce the case of descriptions in a simple way using Terraform.

For brevity, I'll just show the relevant deployment blocks (assume the network is provided). First start with a VM and disk:

resource "grid_deployment" "d1" {
  node = 2
  network_name = grid_network.net.name

  disks {
    name = "mydisk1"
    size = 15
    description = ""

  }

  vms {
    name = "vm1"
    flist = "https://hub.grid.tf/tf-official-vms/ubuntu-22.04.flist"
    cpu = 1
    memory = 1024
    entrypoint = "/sbin/zinit init"
    planetary = true
    mounts {
      disk_name = "mydisk1"
      mount_point = "/"
    }
    env_vars = {
      SSH_KEY = file("~/.ssh/id_rsa.pub")
    }
  }
}

terraform apply

Next we're going to put the disk into a detached state by removing the VM from the deployment. To see that Zos is actually destroying the VM, we can check the Zos logs or try some ongoing network communication with the VM like a running ping. Since the deployment will be reverted to it's original state, it's necessary to check while the next apply is happening.

Note that we add a description to the disk:

resource "grid_deployment" "d1" {
  node = 2
  network_name = grid_network.net.name

  disks {
    name = "mydisk1"
    size = 15
    description = "test" 
  }
}

terraform apply

It looks like it's working, but eventually times out:

...
grid_deployment.d1: Still modifying... [id=26989, 4m10s elapsed]
grid_deployment.d1: Still modifying... [id=26989, 4m20s elapsed]
grid_deployment.d1: Still modifying... [id=26989, 4m30s elapsed]
╷
╷
│ Error: couldn't update deployment with error: error waiting deployment: waiting for deployment 26989 timed out
│ 
│   with grid_deployment.d1,
│   on main.tf line 22, in resource "grid_deployment" "d1":
│   22: resource "grid_deployment" "d1" {

Analysis

In the background, the Go SDK is being called like this:

DeploymentDeployer.Deploy > Deployer.Deploy > Deployer.deploy

The deploy function sends the zos.deployment.update message to the node, and no error is returned.

It then enters the Wait function loop where zos.deployment.changes is called repeatedly. Zos does not return the new version number, so this check never passes. Eventually the timeout is hit and the deployment is reverted to its original state.

Meanwhile, we can clearly see that Zos is applying the changes by destroying the VM.

Fix

I see two possibilities:

Updates to metadata and description are not supported, and Zos should then return an error if this is attempted
Such updates are supported, but Zos is not reporting the changes properly

Nov 30 '23 18:11 scottyeager

interesting, i will look into it

Dec 04 '23 09:12 muhamadazmy

After looking into how updates work in zos, disk workloads don't consider any changes other than disk size and indeed they don't consider it changed unless size changed. After discussions with @muhamadazmy we decided that any change in the workloads data will be considered a valid change and will be set as the current workload only if the version changed.

Dec 05 '23 15:12 AbdelrahmanElawady

Thanks for the details @AbdelrahmanElawady.

What you wrote can explain why this case fails to be reported as an update by Zos:

Disk -> Disk (with description)

But I'm not understanding how it explains the failure in this case (deployment should still be updated, because the VM has been removed, right?):

VM + Disk -> Disk (with description)

Dec 06 '23 16:12 scottyeager

After looking into it with @sameh-farouk, both issues are caused by zos ignoring the disk changes. The reason for timing out in case of deleting has nothing to with VM. Just changing the disk description with no VM will also time out. This is mainly because the Go client is considering the disk changed and keeps waiting until it sees the new version with an OK state. However, since ZOS doesn't consider it a change it never increases the version or applies the change. So, the client keeps waiting until it reaches the time limit because the version in ZOS will never equal the one Go client is waiting for. Here is the condition that assumes the disk version should change in Go client. Also, we tried creating and updating a similar deployment to the one in the issue with RMB calls and deleting the VM works fine and the changes are reported as expected. The only issue is the disk not updating.

Dec 10 '23 14:12 AbdelrahmanElawady

Ah yes, makes sense now @AbdelrahmanElawady. I had forgotten that the disk and VM are separate workloads, and that the client is looking to confirm the version change on both. Thanks :+1:

Dec 19 '23 19:12 scottyeager

zos zos copied to clipboard

Deployment changes not reported when workload description or metadata is changed

Context

Steps to reproduce

Analysis

Fix

zos
zos copied to clipboard