ankaios icon indicating copy to clipboard operation
ankaios copied to clipboard

Invalid Pod spec leads to config volume never being deleted

Open christoph-hamm opened this issue 1 year ago • 1 comments

Current Behavior

If one of the resources in the manifest part of a podman-kube workload is invalid, this can lead to podman kube play and podman kube down to fail. As during the creation of the workload, the config volume is written first and this volume is not deleted if the deletion of the workloads fails, the volume will never be deleted. Each time the Ankaios agents starts, it sees the volume, and tries to delete this workload again. After a restart of the Ankaios agent, it is also confuses the workload instance name of the incorrect and the correct workload and deletes the correct workload.

Expected Behavior

After deleting a podman-kube workload and this workload not having any Podman resources anymore, the Ankaios agent shall delete the config volume of this workload, even if podman kube down fails.

The Ankaios agent shall also be able the handle multiple existing workloads instances with the same workload name and only delete the not needed workload instances.

Steps to Reproduce

  1. Start ank-server with the startup state given below
  2. Start ank-agent
  3. Stop ank-agent and ank-server
  4. Fix the error in the startup state, by setting the apiVersion to v1
  5. Start ank-server
  6. Start ank-agent
  7. Stop and start ank-agent
  • Use podman volume ls to see, the config volume for the nginx workload is not deleted.
  • Look at the log output of ank-agent during the third start. The ank-agent sees two reusalbe workload, also once should already be deleted. It will also fail to delete first workload again.

state.yml:

workloads:
   nginx:
     runtime: podman-kube
     agent: agent_A
     restart: true
     updateStrategy: AT_MOST_ONCE
     accessRights:
       allow: []
       deny: []
     tags: []
     runtimeConfig: |
       manifest: |
         apiVersion: 1
         kind: Pod
         metadata:
           name: nginx
         spec:
           restartPolicy: Never
           containers:
           - name: server
             image: docker.io/nginx:latest
             ports:
             - containerPort: 80
               hostPort: 8080

Context (Environment)

Tested inside LXD containers running Arch Linux and Ubuntu 23.

This also interacts with an Podman error, as a failing podman kube play can leave already created resource (Podman #17434).

Logs

ank-agent first start:

[2023-11-29T13:04:05Z DEBUG ank_agent] Starting the Ankaios agent with
        name: 'agent_A',
        server url: 'http://127.0.0.1:25551/',
        run directory: '/tmp/ankaios/'
[2023-11-29T13:04:05Z TRACE ank_agent::control_interface::directory] Reusing existing directory '"/tmp/ankaios/agent_A_io"'
[2023-11-29T13:04:05Z INFO  ank_agent::agent_manager] Starting ...
[2023-11-29T13:04:05Z DEBUG ank_agent::agent_manager] Start listening to server.
[2023-11-29T13:04:05Z DEBUG grpc::client] gRPC Communication Client starts.
[2023-11-29T13:04:05Z TRACE grpc::execution_command_proxy] RESPONSE=ExecutionRequest { execution_request_enum: Some(UpdateWorkload(UpdateWorkload { added_workloads: [AddedWorkload { name: "nginx", runtime: "podman-kube", dependencies: {},
restart: true, update_strategy: AtMostOnce, access_rights: None, tags: [], runtime_config: "manifest: |\n  apiVersion: 1\n  kind: Pod\n  metadata:\n    name: nginx\n  spec:\n    restartPolicy: Never\n    containers:\n    - name: server\n
    image: docker.io/nginx:latest\n      ports:\n      - containerPort: 80\n        hostPort: 8080\n" }], deleted_workloads: [] })) }
[2023-11-29T13:04:05Z DEBUG ank_agent::agent_manager] Agent 'agent_A' received UpdateWorkload:
        Added workloads: [WorkloadSpec { agent: "agent_A", name: "nginx", tags: [], dependencies: {}, update_strategy: AtMostOnce, restart: true, access_rights: AccessRights { allow: [], deny: [] }, runtime: "podman-kube", runtime_config:
"manifest: |\n  apiVersion: 1\n  kind: Pod\n  metadata:\n    name: nginx\n  spec:\n    restartPolicy: Never\n    containers:\n    - name: server\n      image: docker.io/nginx:latest\n      ports:\n      - containerPort: 80\n        hostPort: 8080\n" }]
        Deleted workloads: []
[2023-11-29T13:04:05Z INFO  ank_agent::runtime_manager] Received a new desired state with '1' added and '0' deleted workloads.
[2023-11-29T13:04:05Z DEBUG ank_agent::runtime_manager] Handling initial workload list.
[2023-11-29T13:04:05Z DEBUG ank_agent::runtime_connectors::runtime_facade] Searching for reusable 'podman-kube' workloads on agent 'agent_A'.
[2023-11-29T13:04:06Z INFO  ank_agent::runtime_manager] Found '0' reusable 'podman-kube' workload(s).
[2023-11-29T13:04:06Z DEBUG ank_agent::runtime_connectors::runtime_facade] Searching for reusable 'podman' workloads on agent 'agent_A'.
[2023-11-29T13:04:06Z TRACE ank_agent::runtime_connectors::podman_cli] Listing workload names for: 'agent'='agent_A'
[2023-11-29T13:04:06Z DEBUG ank_agent::runtime_connectors::podman::podman_runtime] Found 0 reusable workload(s): '[]'
[2023-11-29T13:04:06Z INFO  ank_agent::runtime_manager] Found '0' reusable 'podman' workload(s).
[2023-11-29T13:04:06Z DEBUG ank_agent::runtime_manager] Creating control interface pipes for 'WorkloadSpec { agent: "agent_A", name: "nginx", tags: [], dependencies: {}, update_strategy: AtMostOnce, restart: true, access_rights: AccessRights { allow: [], deny: [] }, runtime: "podman-kube", runtime_config: "manifest: |\n  apiVersion: 1\n  kind: Pod\n  metadata:\n    name: nginx\n  spec:\n    restartPolicy: Never\n    containers:\n    - name: server\n      image: docker.io/nginx:latest\n      ports:\n      - containerPort: 80\n        hostPort: 8080\n" }'
[2023-11-29T13:04:06Z TRACE ank_agent::control_interface::directory] Reusing existing directory '"/tmp/ankaios/agent_A_io/nginx.986b8d2fac1174412d106c512cd7d27aeb237af2b8e96642405606f92918e589"'
[2023-11-29T13:04:06Z TRACE ank_agent::control_interface::fifo] Reusing existing fifo file '"/tmp/ankaios/agent_A_io/nginx.986b8d2fac1174412d106c512cd7d27aeb237af2b8e96642405606f92918e589/input"'
[2023-11-29T13:04:06Z TRACE ank_agent::control_interface::fifo] Reusing existing fifo file '"/tmp/ankaios/agent_A_io/nginx.986b8d2fac1174412d106c512cd7d27aeb237af2b8e96642405606f92918e589/output"'
[2023-11-29T13:04:06Z INFO  ank_agent::runtime_connectors::runtime_facade] Creating 'podman-kube' workload 'nginx' on agent 'agent_A'
[2023-11-29T13:04:06Z WARN  ank_agent::runtime_connectors::runtime_facade] Failed to create workload: 'nginx': 'Could not create workload: 'Execution of command failed: Error: unable to read YAML as Kube Pod: error unmarshaling JSON: while decoding JSON: json: cannot unmarshal number into Go struct field Pod.apiVersion of type string
    ''

podman volume ls after first start:

DRIVER      VOLUME NAME
local       nginx.986b8d2fac1174412d106c512cd7d27aeb237af2b8e96642405606f92918e589.agent_A.config

ank-agent seconds start:

[2023-11-29T13:04:21Z DEBUG ank_agent] Starting the Ankaios agent with
        name: 'agent_A',
        server url: 'http://127.0.0.1:25551/',
        run directory: '/tmp/ankaios/'
[2023-11-29T13:04:21Z TRACE ank_agent::control_interface::directory] Reusing existing directory '"/tmp/ankaios/agent_A_io"'
[2023-11-29T13:04:21Z DEBUG grpc::client] gRPC Communication Client starts.
[2023-11-29T13:04:21Z INFO  ank_agent::agent_manager] Starting ...
[2023-11-29T13:04:21Z DEBUG ank_agent::agent_manager] Start listening to server.
[2023-11-29T13:04:21Z TRACE grpc::execution_command_proxy] RESPONSE=ExecutionRequest { execution_request_enum: Some(UpdateWorkload(UpdateWorkload { added_workloads: [AddedWorkload { name: "nginx", runtime: "podman-kube", dependencies: {},
restart: true, update_strategy: AtMostOnce, access_rights: None, tags: [], runtime_config: "manifest: |\n  apiVersion: v1\n  kind: Pod\n  metadata:\n    name: nginx\n  spec:\n    restartPolicy: Never\n    containers:\n    - name: server\n
     image: docker.io/nginx:latest\n      ports:\n      - containerPort: 80\n        hostPort: 8080\n" }], deleted_workloads: [] })) }
[2023-11-29T13:04:21Z DEBUG ank_agent::agent_manager] Agent 'agent_A' received UpdateWorkload:
        Added workloads: [WorkloadSpec { agent: "agent_A", name: "nginx", tags: [], dependencies: {}, update_strategy: AtMostOnce, restart: true, access_rights: AccessRights { allow: [], deny: [] }, runtime: "podman-kube", runtime_config:
"manifest: |\n  apiVersion: v1\n  kind: Pod\n  metadata:\n    name: nginx\n  spec:\n    restartPolicy: Never\n    containers:\n    - name: server\n      image: docker.io/nginx:latest\n      ports:\n      - containerPort: 80\n        hostPort: 8080\n" }]
        Deleted workloads: []
[2023-11-29T13:04:21Z INFO  ank_agent::runtime_manager] Received a new desired state with '1' added and '0' deleted workloads.
[2023-11-29T13:04:21Z DEBUG ank_agent::runtime_manager] Handling initial workload list.
[2023-11-29T13:04:21Z DEBUG ank_agent::runtime_connectors::runtime_facade] Searching for reusable 'podman' workloads on agent 'agent_A'.
[2023-11-29T13:04:21Z TRACE ank_agent::runtime_connectors::podman_cli] Listing workload names for: 'agent'='agent_A'
[2023-11-29T13:04:21Z DEBUG ank_agent::runtime_connectors::podman::podman_runtime] Found 0 reusable workload(s): '[]'
[2023-11-29T13:04:21Z INFO  ank_agent::runtime_manager] Found '0' reusable 'podman' workload(s).
[2023-11-29T13:04:21Z DEBUG ank_agent::runtime_connectors::runtime_facade] Searching for reusable 'podman-kube' workloads on agent 'agent_A'.
[2023-11-29T13:04:21Z INFO  ank_agent::runtime_manager] Found '1' reusable 'podman-kube' workload(s).
[2023-11-29T13:04:21Z DEBUG ank_agent::runtime_manager] Creating control interface pipes for 'WorkloadSpec { agent: "agent_A", name: "nginx", tags: [], dependencies: {}, update_strategy: AtMostOnce, restart: true, access_rights: AccessRights { allow: [], deny: [] }, runtime: "podman-kube", runtime_config: "manifest: |\n  apiVersion: v1\n  kind: Pod\n  metadata:\n    name: nginx\n  spec:\n    restartPolicy: Never\n    containers:\n    - name: server\n      image: docker.io/nginx:latest\n      ports:\n      - containerPort: 80\n        hostPort: 8080\n" }'
[2023-11-29T13:04:21Z TRACE ank_agent::control_interface::directory] Reusing existing directory '"/tmp/ankaios/agent_A_io/nginx.ca7b437551978d6c73fc2c629fabe4d9a5e59190af669c009ad5659e6b43ef58"'
[2023-11-29T13:04:21Z TRACE ank_agent::control_interface::fifo] Reusing existing fifo file '"/tmp/ankaios/agent_A_io/nginx.ca7b437551978d6c73fc2c629fabe4d9a5e59190af669c009ad5659e6b43ef58/input"'
[2023-11-29T13:04:21Z TRACE ank_agent::control_interface::fifo] Reusing existing fifo file '"/tmp/ankaios/agent_A_io/nginx.ca7b437551978d6c73fc2c629fabe4d9a5e59190af669c009ad5659e6b43ef58/output"'
[2023-11-29T13:04:21Z INFO  ank_agent::runtime_connectors::runtime_facade] Replacing 'podman-kube' workload 'nginx' on agent 'agent_A'
[2023-11-29T13:04:21Z WARN  ank_agent::runtime_connectors::podman_kube::podman_kube_runtime] Could not read pods from volume: "Execution of command failed: Error: no such volume nginx.986b8d2fac1174412d106c512cd7d27aeb237af2b8e96642405606f92918e589.agent_A.pods\n"
[2023-11-29T13:04:21Z DEBUG ank_agent::runtime_connectors::podman_kube::podman_kube_runtime] Deleting workload with workload execution instance name 'nginx.986b8d2fac1174412d106c512cd7d27aeb237af2b8e96642405606f92918e589.agent_A'
[2023-11-29T13:04:21Z WARN  ank_agent::runtime_connectors::runtime_facade] Failed to delete workload when replacing workload 'nginx': 'Could not delete workload 'Execution of command failed: Error: unable to read YAML as Kube Pod: error unmarshaling JSON: while decoding JSON: json: cannot unmarshal number into Go struct field Pod.apiVersion of type string
    ''
[2023-11-29T13:04:21Z DEBUG ank_agent::runtime_connectors::podman_kube::podman_kube_runtime] The workload 'nginx' has been created with workload execution instance name 'WorkloadExecutionInstanceName { agent_name: "agent_A", workload_name: "nginx", hash: "ca7b437551978d6c73fc2c629fabe4d9a5e59190af669c009ad5659e6b43ef58" }'
[2023-11-29T13:04:21Z DEBUG ank_agent::runtime_connectors::podman_kube::podman_kube_runtime] Starting the checker for the workload 'nginx' with workload execution instance name 'nginx.ca7b437551978d6c73fc2c629fabe4d9a5e59190af669c009ad5659e6b43ef58.agent_A'
[2023-11-29T13:04:21Z TRACE ank_agent::runtime_connectors::podman_kube::podman_kube_runtime] Getting the state for the workload 'nginx.ca7b437551978d6c73fc2c629fabe4d9a5e59190af669c009ad5659e6b43ef58.agent_A'
[2023-11-29T13:04:21Z TRACE ank_agent::runtime_connectors::podman_kube::podman_kube_runtime] Received following states for workload 'nginx.ca7b437551978d6c73fc2c629fabe4d9a5e59190af669c009ad5659e6b43ef58.agent_A': '[Running, Running]'
[2023-11-29T13:04:21Z DEBUG ank_agent::generic_polling_state_checker] The workload nginx has changed its state to ExecRunning
[2023-11-29T13:04:21Z TRACE grpc::state_change_proxy] Received UpdateWorkloadState from agent
[2023-11-29T13:04:22Z TRACE ank_agent::runtime_connectors::podman_kube::podman_kube_runtime] Getting the state for the workload 'nginx.ca7b437551978d6c73fc2c629fabe4d9a5e59190af669c009ad5659e6b43ef58.agent_A'
[2023-11-29T13:04:22Z TRACE ank_agent::runtime_connectors::podman_kube::podman_kube_runtime] Received following states for workload 'nginx.ca7b437551978d6c73fc2c629fabe4d9a5e59190af669c009ad5659e6b43ef58.agent_A': '[Running, Running]'
[2023-11-29T13:04:23Z TRACE ank_agent::runtime_connectors::podman_kube::podman_kube_runtime] Getting the state for the workload 'nginx.ca7b437551978d6c73fc2c629fabe4d9a5e59190af669c009ad5659e6b43ef58.agent_A'
[2023-11-29T13:04:23Z TRACE ank_agent::runtime_connectors::podman_kube::podman_kube_runtime] Received following states for workload 'nginx.ca7b437551978d6c73fc2c629fabe4d9a5e59190af669c009ad5659e6b43ef58.agent_A': '[Running, Running]'
[2023-11-29T13:04:24Z TRACE ank_agent::runtime_connectors::podman_kube::podman_kube_runtime] Getting the state for the workload 'nginx.ca7b437551978d6c73fc2c629fabe4d9a5e59190af669c009ad5659e6b43ef58.agent_A'
[2023-11-29T13:04:24Z TRACE ank_agent::runtime_connectors::podman_kube::podman_kube_runtime] Received following states for workload 'nginx.ca7b437551978d6c73fc2c629fabe4d9a5e59190af669c009ad5659e6b43ef58.agent_A': '[Running, Running]'

podman volume ls after second start:

DRIVER      VOLUME NAME
local       nginx.986b8d2fac1174412d106c512cd7d27aeb237af2b8e96642405606f92918e589.agent_A.config
local       nginx.ca7b437551978d6c73fc2c629fabe4d9a5e59190af669c009ad5659e6b43ef58.agent_A.config
local       nginx.ca7b437551978d6c73fc2c629fabe4d9a5e59190af669c009ad5659e6b43ef58.agent_A.pods

ank-agent third start:

[2023-11-29T13:04:26Z DEBUG ank_agent] Starting the Ankaios agent with
        name: 'agent_A',
        server url: 'http://127.0.0.1:25551/',
        run directory: '/tmp/ankaios/'
[2023-11-29T13:04:26Z TRACE ank_agent::control_interface::directory] Reusing existing directory '"/tmp/ankaios/agent_A_io"'
[2023-11-29T13:04:26Z INFO  ank_agent::agent_manager] Starting ...
[2023-11-29T13:04:26Z DEBUG ank_agent::agent_manager] Start listening to server.
[2023-11-29T13:04:26Z DEBUG grpc::client] gRPC Communication Client starts.
[2023-11-29T13:04:26Z TRACE grpc::execution_command_proxy] RESPONSE=ExecutionRequest { execution_request_enum: Some(UpdateWorkload(UpdateWorkload { added_workloads: [AddedWorkload { name: "nginx", runtime: "podman-kube", dependencies: {},
restart: true, update_strategy: AtMostOnce, access_rights: None, tags: [], runtime_config: "manifest: |\n  apiVersion: v1\n  kind: Pod\n  metadata:\n    name: nginx\n  spec:\n    restartPolicy: Never\n    containers:\n    - name: server\n
     image: docker.io/nginx:latest\n      ports:\n      - containerPort: 80\n        hostPort: 8080\n" }], deleted_workloads: [] })) }
[2023-11-29T13:04:26Z DEBUG ank_agent::agent_manager] Agent 'agent_A' received UpdateWorkload:
        Added workloads: [WorkloadSpec { agent: "agent_A", name: "nginx", tags: [], dependencies: {}, update_strategy: AtMostOnce, restart: true, access_rights: AccessRights { allow: [], deny: [] }, runtime: "podman-kube", runtime_config:
"manifest: |\n  apiVersion: v1\n  kind: Pod\n  metadata:\n    name: nginx\n  spec:\n    restartPolicy: Never\n    containers:\n    - name: server\n      image: docker.io/nginx:latest\n      ports:\n      - containerPort: 80\n        hostPort: 8080\n" }]
        Deleted workloads: []
[2023-11-29T13:04:26Z INFO  ank_agent::runtime_manager] Received a new desired state with '1' added and '0' deleted workloads.
[2023-11-29T13:04:26Z DEBUG ank_agent::runtime_manager] Handling initial workload list.
[2023-11-29T13:04:26Z DEBUG ank_agent::runtime_connectors::runtime_facade] Searching for reusable 'podman' workloads on agent 'agent_A'.
[2023-11-29T13:04:26Z TRACE ank_agent::runtime_connectors::podman_cli] Listing workload names for: 'agent'='agent_A'
[2023-11-29T13:04:26Z DEBUG ank_agent::runtime_connectors::podman::podman_runtime] Found 0 reusable workload(s): '[]'
[2023-11-29T13:04:26Z INFO  ank_agent::runtime_manager] Found '0' reusable 'podman' workload(s).
[2023-11-29T13:04:26Z DEBUG ank_agent::runtime_connectors::runtime_facade] Searching for reusable 'podman-kube' workloads on agent 'agent_A'.
[2023-11-29T13:04:26Z INFO  ank_agent::runtime_manager] Found '2' reusable 'podman-kube' workload(s).
[2023-11-29T13:04:26Z DEBUG ank_agent::runtime_manager] Creating control interface pipes for 'WorkloadSpec { agent: "agent_A", name: "nginx", tags: [], dependencies: {}, update_strategy: AtMostOnce, restart: true, access_rights: AccessRights { allow: [], deny: [] }, runtime: "podman-kube", runtime_config: "manifest: |\n  apiVersion: v1\n  kind: Pod\n  metadata:\n    name: nginx\n  spec:\n    restartPolicy: Never\n    containers:\n    - name: server\n      image: docker.io/nginx:latest\n      ports:\n      - containerPort: 80\n        hostPort: 8080\n" }'
[2023-11-29T13:04:26Z TRACE ank_agent::control_interface::directory] Reusing existing directory '"/tmp/ankaios/agent_A_io/nginx.ca7b437551978d6c73fc2c629fabe4d9a5e59190af669c009ad5659e6b43ef58"'
[2023-11-29T13:04:26Z TRACE ank_agent::control_interface::fifo] Reusing existing fifo file '"/tmp/ankaios/agent_A_io/nginx.ca7b437551978d6c73fc2c629fabe4d9a5e59190af669c009ad5659e6b43ef58/input"'
[2023-11-29T13:04:26Z TRACE ank_agent::control_interface::fifo] Reusing existing fifo file '"/tmp/ankaios/agent_A_io/nginx.ca7b437551978d6c73fc2c629fabe4d9a5e59190af669c009ad5659e6b43ef58/output"'
[2023-11-29T13:04:26Z INFO  ank_agent::runtime_connectors::runtime_facade] Replacing 'podman-kube' workload 'nginx' on agent 'agent_A'
[2023-11-29T13:04:26Z INFO  ank_agent::runtime_connectors::runtime_facade] Deleting 'podman-kube' workload 'nginx' on agent 'agent_A'
[2023-11-29T13:04:26Z DEBUG ank_agent::runtime_connectors::podman_kube::podman_kube_runtime] Deleting workload with workload execution instance name 'nginx.ca7b437551978d6c73fc2c629fabe4d9a5e59190af669c009ad5659e6b43ef58.agent_A'
[2023-11-29T13:04:26Z WARN  ank_agent::runtime_connectors::podman_kube::podman_kube_runtime] Could not read pods from volume: "Execution of command failed: Error: no such volume nginx.986b8d2fac1174412d106c512cd7d27aeb237af2b8e96642405606f92918e589.agent_A.pods\n"
[2023-11-29T13:04:26Z DEBUG ank_agent::runtime_connectors::podman_kube::podman_kube_runtime] Deleting workload with workload execution instance name 'nginx.986b8d2fac1174412d106c512cd7d27aeb237af2b8e96642405606f92918e589.agent_A'
[2023-11-29T13:04:26Z WARN  ank_agent::runtime_connectors::runtime_facade] Failed to delete workload when replacing workload 'nginx': 'Could not delete workload 'Execution of command failed: Error: unable to read YAML as Kube Pod: error unmarshaling JSON: while decoding JSON: json: cannot unmarshal number into Go struct field Pod.apiVersion of type string
    ''
[2023-11-29T13:04:27Z WARN  ank_agent::runtime_connectors::runtime_facade] Failed to create workload when replacing workload 'nginx': 'Could not create workload: 'Execution of command failed: Error: adding pod to state: name "nginx" is in
use: pod already exists
    ''

podman volume ls after third start:

DRIVER      VOLUME NAME
local       nginx.986b8d2fac1174412d106c512cd7d27aeb237af2b8e96642405606f92918e589.agent_A.config

Final result

To be filled by the one closing the issue.

christoph-hamm avatar Nov 29 '23 13:11 christoph-hamm