Pebble cannot start `argo-controller` service: localhost:9090 address already in use

Open mvlassis opened this issue 1 year ago • 1 comments

Bug Description

This issue was encountered in the deploy-cfk-to-eks (1.9) action in this repository. The full logs can be found here.

This is the status of the kubeflow model:

Model     Controller           Cloud/Region      Version  SLA          Timestamp
kubeflow  kubeflow-controller  eks/eu-central-1  3.4.4    unsupported  10:49:00Z

App                      Version                  Status   Scale  Charm                    Channel          Rev  Address         Exposed  Message
admission-webhook                                 active       1  admission-webhook        1.9/stable       344  10.100.141.48   no       
argo-controller                                   waiting      1  argo-controller          3.4/stable       545  10.100.65.26    no       installing agent
dex-auth                                          active       1  dex-auth                 2.39/stable      548  10.100.159.91   no       
envoy                                             active       1  envoy                    2.2/stable       263  10.100.179.113  no       
istio-ingressgateway                              active       1  istio-gateway            1.22/stable     1127  10.100.37.243   no       
istio-pilot                                       active       1  istio-pilot              1.22/stable     1079  10.100.213.221  no       
jupyter-controller                                active       1  jupyter-controller       1.9/stable      1038  10.100.141.89   no       
jupyter-ui                                        active       1  jupyter-ui               1.9/stable       961  10.100.101.218  no       
katib-controller                                  active       1  katib-controller         0.17/stable      750  10.100.83.209   no       
katib-db                 8.0.36-0ubuntu0.22.04.1  active       1  mysql-k8s                8.0/stable       153  10.100.140.12   no       
katib-db-manager                                  active       1  katib-db-manager         0.17/stable      713  10.100.169.0    no       
katib-ui                                          active       1  katib-ui                 0.17/stable      713  10.100.229.177  no       
kfp-api                                           active       1  kfp-api                  2.2/stable      1552  10.100.61.32    no       
kfp-db                   8.0.36-0ubuntu0.22.04.1  active       1  mysql-k8s                8.0/stable       153  10.100.157.254  no       
kfp-metadata-writer                               active       1  kfp-metadata-writer      2.2/stable       617  10.100.132.118  no       
kfp-persistence                                   active       1  kfp-persistence          2.2/stable      1560  10.100.75.21    no       
kfp-profile-controller                            active       1  kfp-profile-controller   2.2/stable      1518  10.100.167.176  no       
kfp-schedwf                                       waiting      1  kfp-schedwf              2.2/stable      1571  10.100.12.101   no       installing agent
kfp-ui                                            active       1  kfp-ui                   2.2/stable      1555  10.100.247.192  no       
kfp-viewer                                        active       1  kfp-viewer               2.2/stable      1586  10.100.90.122   no       
kfp-viz                                           active       1  kfp-viz                  2.2/stable      1504  10.100.233.24   no       
knative-eventing                                  active       1  knative-eventing         1.12/stable      459  10.100.254.114  no       
knative-operator                                  active       1  knative-operator         1.12/stable      433  10.100.125.79   no       
knative-serving                                   active       1  knative-serving          1.12/stable      460  10.100.111.156  no       
kserve-controller                                 active       1  kserve-controller        0.13/stable      597  10.100.90.248   no       
kubeflow-dashboard                                active       1  kubeflow-dashboard       1.9/stable       659  10.100.70.77    no       
kubeflow-profiles                                 active       1  kubeflow-profiles        1.9/stable       419  10.100.80.72    no       
kubeflow-roles                                    active       1  kubeflow-roles           1.9/stable       240  10.100.207.159  no       
kubeflow-volumes                                  active       1  kubeflow-volumes         1.9/stable       348  10.100.237.212  no       
metacontroller-operator                           active       1  metacontroller-operator  3.0/stable       311  10.100.106.16   no       
minio                    res:oci-image@5102166    active       1  minio                    ckf-1.9/stable   347  10.100.95.94    no       
mlmd                                              active       1  mlmd                     ckf-1.9/stable   213  10.100.23.120   no       
oidc-gatekeeper                                   active       1  oidc-gatekeeper          ckf-1.9/stable   423  10.100.195.15   no       
pvcviewer-operator                                active       1  pvcviewer-operator       1.9/stable       157  10.100.89.110   no       
tensorboard-controller                            active       1  tensorboard-controller   1.9/stable       333  10.100.41.128   no       
tensorboards-web-app                              active       1  tensorboards-web-app     1.9/stable       321  10.100.132.117  no       
training-operator                                 active       1  training-operator        1.8/stable       503  10.100.196.94   no       

Unit                        Workload  Agent  Address         Ports          Message
admission-webhook/0*        active    idle   192.168.35.115                 
argo-controller/0*          waiting   idle   192.168.48.34                  [container:argo-controller] Waiting for Pebble services (argo-controller).  If this persists, it could be a blocking ...
dex-auth/0*                 active    idle   192.168.16.108                 
envoy/0*                    active    idle   192.168.32.159                 
istio-ingressgateway/0*     active    idle   192.168.53.251                 
istio-pilot/0*              active    idle   192.168.55.177                 
jupyter-controller/0*       active    idle   192.168.7.169                  
jupyter-ui/0*               active    idle   192.168.52.237                 
katib-controller/0*         active    idle   192.168.5.97                   
katib-db-manager/0*         active    idle   192.168.63.216                 
katib-db/0*                 active    idle   192.168.29.5                   Primary
katib-ui/0*                 active    idle   192.168.40.53                  
kfp-api/0*                  active    idle   192.168.47.139                 
kfp-db/0*                   active    idle   192.168.44.231                 Primary
kfp-metadata-writer/0*      active    idle   192.168.5.186                  
kfp-persistence/0*          active    idle   192.168.31.174                 
kfp-profile-controller/0*   active    idle   192.168.4.69                   
kfp-schedwf/0*              active    idle   192.168.26.199                 
kfp-ui/0*                   active    idle   192.168.63.63                  
kfp-viewer/0*               active    idle   192.168.10.61                  
kfp-viz/0*                  active    idle   192.168.6.70                   
knative-eventing/0*         active    idle   192.168.11.160                 
knative-operator/0*         active    idle   192.168.11.169                 
knative-serving/0*          active    idle   192.168.4.247                  
kserve-controller/0*        active    idle   192.168.29.105                 
kubeflow-dashboard/0*       active    idle   192.168.18.73                  
kubeflow-profiles/0*        active    idle   192.168.37.127                 
kubeflow-roles/0*           active    idle   192.168.37.187                 
kubeflow-volumes/0*         active    idle   192.168.50.26                  
metacontroller-operator/0*  active    idle   192.168.63.162                 
minio/0*                    active    idle   192.168.61.122  9000-9001/TCP  
mlmd/0*                     active    idle   192.168.41.252                 
oidc-gatekeeper/0*          active    idle   192.168.62.78                  
pvcviewer-operator/0*       active    idle   192.168.36.84                  
tensorboard-controller/0*   active    idle   192.168.24.115                 
tensorboards-web-app/0*     active    idle   192.168.10.10                  
training-operator/0*        active    idle   192.168.51.141

Note that all charms are active except argo-controller. The kfp-schedwf/0 unit is active, so the kfp-schedwf charm being in waiting status should be intermittent.

It seems that the issue is that Pebble cannot start the argo-controller service, because 2 processes are trying to use the same port. Here is an excerpt from the logs:

unit-argo-controller-0: 10:24:49 ERROR unit.argo-controller/0.juju-log object-storage:8: execute_components caught unhandled exception when executing configure_charm for container:argo-controller
Traceback (most recent call last):
  File "/var/lib/juju/agents/unit-argo-controller-0/charm/venv/charmed_kubeflow_chisme/components/charm_reconciler.py", line 92, in reconcile
    component_item.component.configure_charm(event)
  File "/var/lib/juju/agents/unit-argo-controller-0/charm/venv/charmed_kubeflow_chisme/components/component.py", line 50, in configure_charm
    self._configure_unit(event)
  File "/var/lib/juju/agents/unit-argo-controller-0/charm/venv/charmed_kubeflow_chisme/components/pebble_component.py", line 131, in _configure_unit
    self._update_layer()
  File "/var/lib/juju/agents/unit-argo-controller-0/charm/venv/charmed_kubeflow_chisme/components/pebble_component.py", line 142, in _update_layer
    container.replan()
  File "/var/lib/juju/agents/unit-argo-controller-0/charm/venv/ops/model.py", line 2211, in replan
    self._pebble.replan_services()
  File "/var/lib/juju/agents/unit-argo-controller-0/charm/venv/ops/pebble.py", line 1993, in replan_services
    return self._services_action('replan', [], timeout, delay)
  File "/var/lib/juju/agents/unit-argo-controller-0/charm/venv/ops/pebble.py", line 2090, in _services_action
    raise ChangeError(change.err, change)
ops.pebble.ChangeError: cannot perform the following tasks:
- Start service "argo-controller" (cannot start service: exited quickly with code 2)
----- Logs from task 0 -----
2024-08-08T10:24:49Z INFO Most recent service output:
    (...)
    time="2024-08-08T10:24:49.455Z" level=info msg="not enabling pprof debug endpoints"
    time="2024-08-08T10:24:49.480Z" level=info msg="Configuration:\nartifactRepository:\n  archiveLogs: true\n  s3:\n    accessKeySecret:\n      key: accesskey\n      name: mlpipeline-minio-artifact\n    bucket: mlpipeline\n    endpoint: minio.kubeflow:9000\n    insecure: true\n    secretKeySecret:\n      key: secretkey\n      name: mlpipeline-minio-artifact\nexecutor:\n  imagePullPolicy: IfNotPresent\n  name: \"\"\n  resources: {}\ninitialDelay: 0s\nmetricsConfig: {}\nnodeEvents: {}\npodSpecLogStrategy: {}\nsso:\n  clientId:\n    key: \"\"\n  clientSecret:\n    key: \"\"\n  issuer: \"\"\n  redirectUrl: \"\"\n  sessionExpiry: 0s\ntelemetryConfig: {}\n"
    time="2024-08-08T10:24:49.480Z" level=info msg="Persistence configuration disabled"
    time="2024-08-08T10:24:49.480Z" level=info executorImage="charmedkubeflow/argoexec:3.4.16-ffcffa9" executorImagePullPolicy=IfNotPresent managedNamespace=
    I0808 10:24:49.481417      14 leaderelection.go:248] attempting to acquire leader lease kubeflow/workflow-controller...
    time="2024-08-08T10:24:49.481Z" level=info msg="Starting dummy metrics server at localhost:9090/metrics"
    I0808 10:24:49.495949      14 leaderelection.go:258] successfully acquired lease kubeflow/workflow-controller
    time="2024-08-08T10:24:49.496Z" level=info msg="new leader" leader=argo-controller
    time="2024-08-08T10:24:49.499Z" level=info msg="Starting Workflow Controller" defaultRequeueTime=10s version=v3.4.16
    time="2024-08-08T10:24:49.499Z" level=info msg="Current Worker Numbers" podCleanup=4 workflowTtlWorkers=4 workflowWorkers=32
    time="2024-08-08T10:24:49.499Z" level=info msg="Watching task results" labelSelector="!workflows.argoproj.io/controller-instanceid,workflows.argoproj.io/workflow"
    time="2024-08-08T10:24:49.499Z" level=info msg=Plugins executorPlugins=false
    time="2024-08-08T10:24:49.502Z" level=info msg="Starting prometheus metrics server at localhost:9090/metrics"
    panic: listen tcp :9090: bind: address already in use
    
    goroutine 100 [running]:
    github.com/argoproj/argo-workflows/v3/workflow/metrics.runServer.func3()
    	/go/src/github.com/argoproj/argo-workflows/workflow/metrics/server.go:90 +0x12c
    created by github.com/argoproj/argo-workflows/v3/workflow/metrics.runServer
    	/go/src/github.com/argoproj/argo-workflows/workflow/metrics/server.go:87 +0x3b7
2024-08-08T10:24:49Z ERROR cannot start service: exited quickly with code 2
-----

Because the error was encountered during a Github action, I couldn't access the deployment and investigate further.

Note that this issue was not encountered during a previous run of the Github action, which can be found here. It's not clear whether this issue is reproducible or just intermittent.

To Reproduce

From the main page of the repository, go to Actions, select the "Create EKS cluster, deploy CKF and run bundle test" action, and run it with the following options:

Comma-separated list of bundle versions e.g. "1.7","1.8": 1.9
Kubernetes version to be used for the AKS cluster: Leave empty
Branch to run the UATs from e.g. main or track/1.8: Leave empty

Environment

This job tries to deploy the UATs, using the following configuration from the dependencies.yaml file found here:

K8S_VERSION: "1.29"
JUJU_VERSION: "3.4"
JUJU_VERSION_WITH_PATCH: "3.4.4"
UATS_BRANCH: "main"

Relevant Log Output

unit-argo-controller-0: 10:24:49 ERROR unit.argo-controller/0.juju-log object-storage:8: execute_components caught unhandled exception when executing configure_charm for container:argo-controller
Traceback (most recent call last):
  File "/var/lib/juju/agents/unit-argo-controller-0/charm/venv/charmed_kubeflow_chisme/components/charm_reconciler.py", line 92, in reconcile
    component_item.component.configure_charm(event)
  File "/var/lib/juju/agents/unit-argo-controller-0/charm/venv/charmed_kubeflow_chisme/components/component.py", line 50, in configure_charm
    self._configure_unit(event)
  File "/var/lib/juju/agents/unit-argo-controller-0/charm/venv/charmed_kubeflow_chisme/components/pebble_component.py", line 131, in _configure_unit
    self._update_layer()
  File "/var/lib/juju/agents/unit-argo-controller-0/charm/venv/charmed_kubeflow_chisme/components/pebble_component.py", line 142, in _update_layer
    container.replan()
  File "/var/lib/juju/agents/unit-argo-controller-0/charm/venv/ops/model.py", line 2211, in replan
    self._pebble.replan_services()
  File "/var/lib/juju/agents/unit-argo-controller-0/charm/venv/ops/pebble.py", line 1993, in replan_services
    return self._services_action('replan', [], timeout, delay)
  File "/var/lib/juju/agents/unit-argo-controller-0/charm/venv/ops/pebble.py", line 2090, in _services_action
    raise ChangeError(change.err, change)
ops.pebble.ChangeError: cannot perform the following tasks:
- Start service "argo-controller" (cannot start service: exited quickly with code 2)
----- Logs from task 0 -----
2024-08-08T10:24:49Z INFO Most recent service output:
    (...)
    time="2024-08-08T10:24:49.455Z" level=info msg="not enabling pprof debug endpoints"
    time="2024-08-08T10:24:49.480Z" level=info msg="Configuration:\nartifactRepository:\n  archiveLogs: true\n  s3:\n    accessKeySecret:\n      key: accesskey\n      name: mlpipeline-minio-artifact\n    bucket: mlpipeline\n    endpoint: minio.kubeflow:9000\n    insecure: true\n    secretKeySecret:\n      key: secretkey\n      name: mlpipeline-minio-artifact\nexecutor:\n  imagePullPolicy: IfNotPresent\n  name: \"\"\n  resources: {}\ninitialDelay: 0s\nmetricsConfig: {}\nnodeEvents: {}\npodSpecLogStrategy: {}\nsso:\n  clientId:\n    key: \"\"\n  clientSecret:\n    key: \"\"\n  issuer: \"\"\n  redirectUrl: \"\"\n  sessionExpiry: 0s\ntelemetryConfig: {}\n"
    time="2024-08-08T10:24:49.480Z" level=info msg="Persistence configuration disabled"
    time="2024-08-08T10:24:49.480Z" level=info executorImage="charmedkubeflow/argoexec:3.4.16-ffcffa9" executorImagePullPolicy=IfNotPresent managedNamespace=
    I0808 10:24:49.481417      14 leaderelection.go:248] attempting to acquire leader lease kubeflow/workflow-controller...
    time="2024-08-08T10:24:49.481Z" level=info msg="Starting dummy metrics server at localhost:9090/metrics"
    I0808 10:24:49.495949      14 leaderelection.go:258] successfully acquired lease kubeflow/workflow-controller
    time="2024-08-08T10:24:49.496Z" level=info msg="new leader" leader=argo-controller
    time="2024-08-08T10:24:49.499Z" level=info msg="Starting Workflow Controller" defaultRequeueTime=10s version=v3.4.16
    time="2024-08-08T10:24:49.499Z" level=info msg="Current Worker Numbers" podCleanup=4 workflowTtlWorkers=4 workflowWorkers=32
    time="2024-08-08T10:24:49.499Z" level=info msg="Watching task results" labelSelector="!workflows.argoproj.io/controller-instanceid,workflows.argoproj.io/workflow"
    time="2024-08-08T10:24:49.499Z" level=info msg=Plugins executorPlugins=false
    time="2024-08-08T10:24:49.502Z" level=info msg="Starting prometheus metrics server at localhost:9090/metrics"
    panic: listen tcp :9090: bind: address already in use
    
    goroutine 100 [running]:
    github.com/argoproj/argo-workflows/v3/workflow/metrics.runServer.func3()
    	/go/src/github.com/argoproj/argo-workflows/workflow/metrics/server.go:90 +0x12c
    created by github.com/argoproj/argo-workflows/v3/workflow/metrics.runServer
    	/go/src/github.com/argoproj/argo-workflows/workflow/metrics/server.go:87 +0x3b7
2024-08-08T10:24:49Z ERROR cannot start service: exited quickly with code 2
-----

Additional Context

@DnPlas noted that this issue may be related to this issue that was encountered in the argo-controller charm.

Aug 08 '24 12:08 mvlassis

Thank you for reporting us your feedback!

The internal ticket has been created: https://warthogs.atlassian.net/browse/KF-6111.

This message was autogenerated

Aug 08 '24 12:08 syncronize-issues-to-jira[bot]