Pebble cannot start `argo-controller` service: localhost:9090 address already in use
Bug Description
This issue was encountered in the deploy-cfk-to-eks (1.9) action in this repository. The full logs can be found here.
This is the status of the kubeflow model:
Model Controller Cloud/Region Version SLA Timestamp
kubeflow kubeflow-controller eks/eu-central-1 3.4.4 unsupported 10:49:00Z
App Version Status Scale Charm Channel Rev Address Exposed Message
admission-webhook active 1 admission-webhook 1.9/stable 344 10.100.141.48 no
argo-controller waiting 1 argo-controller 3.4/stable 545 10.100.65.26 no installing agent
dex-auth active 1 dex-auth 2.39/stable 548 10.100.159.91 no
envoy active 1 envoy 2.2/stable 263 10.100.179.113 no
istio-ingressgateway active 1 istio-gateway 1.22/stable 1127 10.100.37.243 no
istio-pilot active 1 istio-pilot 1.22/stable 1079 10.100.213.221 no
jupyter-controller active 1 jupyter-controller 1.9/stable 1038 10.100.141.89 no
jupyter-ui active 1 jupyter-ui 1.9/stable 961 10.100.101.218 no
katib-controller active 1 katib-controller 0.17/stable 750 10.100.83.209 no
katib-db 8.0.36-0ubuntu0.22.04.1 active 1 mysql-k8s 8.0/stable 153 10.100.140.12 no
katib-db-manager active 1 katib-db-manager 0.17/stable 713 10.100.169.0 no
katib-ui active 1 katib-ui 0.17/stable 713 10.100.229.177 no
kfp-api active 1 kfp-api 2.2/stable 1552 10.100.61.32 no
kfp-db 8.0.36-0ubuntu0.22.04.1 active 1 mysql-k8s 8.0/stable 153 10.100.157.254 no
kfp-metadata-writer active 1 kfp-metadata-writer 2.2/stable 617 10.100.132.118 no
kfp-persistence active 1 kfp-persistence 2.2/stable 1560 10.100.75.21 no
kfp-profile-controller active 1 kfp-profile-controller 2.2/stable 1518 10.100.167.176 no
kfp-schedwf waiting 1 kfp-schedwf 2.2/stable 1571 10.100.12.101 no installing agent
kfp-ui active 1 kfp-ui 2.2/stable 1555 10.100.247.192 no
kfp-viewer active 1 kfp-viewer 2.2/stable 1586 10.100.90.122 no
kfp-viz active 1 kfp-viz 2.2/stable 1504 10.100.233.24 no
knative-eventing active 1 knative-eventing 1.12/stable 459 10.100.254.114 no
knative-operator active 1 knative-operator 1.12/stable 433 10.100.125.79 no
knative-serving active 1 knative-serving 1.12/stable 460 10.100.111.156 no
kserve-controller active 1 kserve-controller 0.13/stable 597 10.100.90.248 no
kubeflow-dashboard active 1 kubeflow-dashboard 1.9/stable 659 10.100.70.77 no
kubeflow-profiles active 1 kubeflow-profiles 1.9/stable 419 10.100.80.72 no
kubeflow-roles active 1 kubeflow-roles 1.9/stable 240 10.100.207.159 no
kubeflow-volumes active 1 kubeflow-volumes 1.9/stable 348 10.100.237.212 no
metacontroller-operator active 1 metacontroller-operator 3.0/stable 311 10.100.106.16 no
minio res:oci-image@5102166 active 1 minio ckf-1.9/stable 347 10.100.95.94 no
mlmd active 1 mlmd ckf-1.9/stable 213 10.100.23.120 no
oidc-gatekeeper active 1 oidc-gatekeeper ckf-1.9/stable 423 10.100.195.15 no
pvcviewer-operator active 1 pvcviewer-operator 1.9/stable 157 10.100.89.110 no
tensorboard-controller active 1 tensorboard-controller 1.9/stable 333 10.100.41.128 no
tensorboards-web-app active 1 tensorboards-web-app 1.9/stable 321 10.100.132.117 no
training-operator active 1 training-operator 1.8/stable 503 10.100.196.94 no
Unit Workload Agent Address Ports Message
admission-webhook/0* active idle 192.168.35.115
argo-controller/0* waiting idle 192.168.48.34 [container:argo-controller] Waiting for Pebble services (argo-controller). If this persists, it could be a blocking ...
dex-auth/0* active idle 192.168.16.108
envoy/0* active idle 192.168.32.159
istio-ingressgateway/0* active idle 192.168.53.251
istio-pilot/0* active idle 192.168.55.177
jupyter-controller/0* active idle 192.168.7.169
jupyter-ui/0* active idle 192.168.52.237
katib-controller/0* active idle 192.168.5.97
katib-db-manager/0* active idle 192.168.63.216
katib-db/0* active idle 192.168.29.5 Primary
katib-ui/0* active idle 192.168.40.53
kfp-api/0* active idle 192.168.47.139
kfp-db/0* active idle 192.168.44.231 Primary
kfp-metadata-writer/0* active idle 192.168.5.186
kfp-persistence/0* active idle 192.168.31.174
kfp-profile-controller/0* active idle 192.168.4.69
kfp-schedwf/0* active idle 192.168.26.199
kfp-ui/0* active idle 192.168.63.63
kfp-viewer/0* active idle 192.168.10.61
kfp-viz/0* active idle 192.168.6.70
knative-eventing/0* active idle 192.168.11.160
knative-operator/0* active idle 192.168.11.169
knative-serving/0* active idle 192.168.4.247
kserve-controller/0* active idle 192.168.29.105
kubeflow-dashboard/0* active idle 192.168.18.73
kubeflow-profiles/0* active idle 192.168.37.127
kubeflow-roles/0* active idle 192.168.37.187
kubeflow-volumes/0* active idle 192.168.50.26
metacontroller-operator/0* active idle 192.168.63.162
minio/0* active idle 192.168.61.122 9000-9001/TCP
mlmd/0* active idle 192.168.41.252
oidc-gatekeeper/0* active idle 192.168.62.78
pvcviewer-operator/0* active idle 192.168.36.84
tensorboard-controller/0* active idle 192.168.24.115
tensorboards-web-app/0* active idle 192.168.10.10
training-operator/0* active idle 192.168.51.141
Note that all charms are active except argo-controller. The kfp-schedwf/0 unit is active, so the kfp-schedwf charm being in waiting status should be intermittent.
It seems that the issue is that Pebble cannot start the argo-controller service, because 2 processes are trying to use the same port. Here is an excerpt from the logs:
unit-argo-controller-0: 10:24:49 ERROR unit.argo-controller/0.juju-log object-storage:8: execute_components caught unhandled exception when executing configure_charm for container:argo-controller
Traceback (most recent call last):
File "/var/lib/juju/agents/unit-argo-controller-0/charm/venv/charmed_kubeflow_chisme/components/charm_reconciler.py", line 92, in reconcile
component_item.component.configure_charm(event)
File "/var/lib/juju/agents/unit-argo-controller-0/charm/venv/charmed_kubeflow_chisme/components/component.py", line 50, in configure_charm
self._configure_unit(event)
File "/var/lib/juju/agents/unit-argo-controller-0/charm/venv/charmed_kubeflow_chisme/components/pebble_component.py", line 131, in _configure_unit
self._update_layer()
File "/var/lib/juju/agents/unit-argo-controller-0/charm/venv/charmed_kubeflow_chisme/components/pebble_component.py", line 142, in _update_layer
container.replan()
File "/var/lib/juju/agents/unit-argo-controller-0/charm/venv/ops/model.py", line 2211, in replan
self._pebble.replan_services()
File "/var/lib/juju/agents/unit-argo-controller-0/charm/venv/ops/pebble.py", line 1993, in replan_services
return self._services_action('replan', [], timeout, delay)
File "/var/lib/juju/agents/unit-argo-controller-0/charm/venv/ops/pebble.py", line 2090, in _services_action
raise ChangeError(change.err, change)
ops.pebble.ChangeError: cannot perform the following tasks:
- Start service "argo-controller" (cannot start service: exited quickly with code 2)
----- Logs from task 0 -----
2024-08-08T10:24:49Z INFO Most recent service output:
(...)
time="2024-08-08T10:24:49.455Z" level=info msg="not enabling pprof debug endpoints"
time="2024-08-08T10:24:49.480Z" level=info msg="Configuration:\nartifactRepository:\n archiveLogs: true\n s3:\n accessKeySecret:\n key: accesskey\n name: mlpipeline-minio-artifact\n bucket: mlpipeline\n endpoint: minio.kubeflow:9000\n insecure: true\n secretKeySecret:\n key: secretkey\n name: mlpipeline-minio-artifact\nexecutor:\n imagePullPolicy: IfNotPresent\n name: \"\"\n resources: {}\ninitialDelay: 0s\nmetricsConfig: {}\nnodeEvents: {}\npodSpecLogStrategy: {}\nsso:\n clientId:\n key: \"\"\n clientSecret:\n key: \"\"\n issuer: \"\"\n redirectUrl: \"\"\n sessionExpiry: 0s\ntelemetryConfig: {}\n"
time="2024-08-08T10:24:49.480Z" level=info msg="Persistence configuration disabled"
time="2024-08-08T10:24:49.480Z" level=info executorImage="charmedkubeflow/argoexec:3.4.16-ffcffa9" executorImagePullPolicy=IfNotPresent managedNamespace=
I0808 10:24:49.481417 14 leaderelection.go:248] attempting to acquire leader lease kubeflow/workflow-controller...
time="2024-08-08T10:24:49.481Z" level=info msg="Starting dummy metrics server at localhost:9090/metrics"
I0808 10:24:49.495949 14 leaderelection.go:258] successfully acquired lease kubeflow/workflow-controller
time="2024-08-08T10:24:49.496Z" level=info msg="new leader" leader=argo-controller
time="2024-08-08T10:24:49.499Z" level=info msg="Starting Workflow Controller" defaultRequeueTime=10s version=v3.4.16
time="2024-08-08T10:24:49.499Z" level=info msg="Current Worker Numbers" podCleanup=4 workflowTtlWorkers=4 workflowWorkers=32
time="2024-08-08T10:24:49.499Z" level=info msg="Watching task results" labelSelector="!workflows.argoproj.io/controller-instanceid,workflows.argoproj.io/workflow"
time="2024-08-08T10:24:49.499Z" level=info msg=Plugins executorPlugins=false
time="2024-08-08T10:24:49.502Z" level=info msg="Starting prometheus metrics server at localhost:9090/metrics"
panic: listen tcp :9090: bind: address already in use
goroutine 100 [running]:
github.com/argoproj/argo-workflows/v3/workflow/metrics.runServer.func3()
/go/src/github.com/argoproj/argo-workflows/workflow/metrics/server.go:90 +0x12c
created by github.com/argoproj/argo-workflows/v3/workflow/metrics.runServer
/go/src/github.com/argoproj/argo-workflows/workflow/metrics/server.go:87 +0x3b7
2024-08-08T10:24:49Z ERROR cannot start service: exited quickly with code 2
-----
Because the error was encountered during a Github action, I couldn't access the deployment and investigate further.
Note that this issue was not encountered during a previous run of the Github action, which can be found here. It's not clear whether this issue is reproducible or just intermittent.
To Reproduce
From the main page of the repository, go to Actions, select the "Create EKS cluster, deploy CKF and run bundle test" action, and run it with the following options:
- Comma-separated list of bundle versions e.g. "1.7","1.8":
1.9 - Kubernetes version to be used for the AKS cluster: Leave empty
- Branch to run the UATs from e.g. main or track/1.8: Leave empty
Environment
This job tries to deploy the UATs, using the following configuration from the dependencies.yaml file found here:
- K8S_VERSION: "1.29"
- JUJU_VERSION: "3.4"
- JUJU_VERSION_WITH_PATCH: "3.4.4"
- UATS_BRANCH: "main"
Relevant Log Output
unit-argo-controller-0: 10:24:49 ERROR unit.argo-controller/0.juju-log object-storage:8: execute_components caught unhandled exception when executing configure_charm for container:argo-controller
Traceback (most recent call last):
File "/var/lib/juju/agents/unit-argo-controller-0/charm/venv/charmed_kubeflow_chisme/components/charm_reconciler.py", line 92, in reconcile
component_item.component.configure_charm(event)
File "/var/lib/juju/agents/unit-argo-controller-0/charm/venv/charmed_kubeflow_chisme/components/component.py", line 50, in configure_charm
self._configure_unit(event)
File "/var/lib/juju/agents/unit-argo-controller-0/charm/venv/charmed_kubeflow_chisme/components/pebble_component.py", line 131, in _configure_unit
self._update_layer()
File "/var/lib/juju/agents/unit-argo-controller-0/charm/venv/charmed_kubeflow_chisme/components/pebble_component.py", line 142, in _update_layer
container.replan()
File "/var/lib/juju/agents/unit-argo-controller-0/charm/venv/ops/model.py", line 2211, in replan
self._pebble.replan_services()
File "/var/lib/juju/agents/unit-argo-controller-0/charm/venv/ops/pebble.py", line 1993, in replan_services
return self._services_action('replan', [], timeout, delay)
File "/var/lib/juju/agents/unit-argo-controller-0/charm/venv/ops/pebble.py", line 2090, in _services_action
raise ChangeError(change.err, change)
ops.pebble.ChangeError: cannot perform the following tasks:
- Start service "argo-controller" (cannot start service: exited quickly with code 2)
----- Logs from task 0 -----
2024-08-08T10:24:49Z INFO Most recent service output:
(...)
time="2024-08-08T10:24:49.455Z" level=info msg="not enabling pprof debug endpoints"
time="2024-08-08T10:24:49.480Z" level=info msg="Configuration:\nartifactRepository:\n archiveLogs: true\n s3:\n accessKeySecret:\n key: accesskey\n name: mlpipeline-minio-artifact\n bucket: mlpipeline\n endpoint: minio.kubeflow:9000\n insecure: true\n secretKeySecret:\n key: secretkey\n name: mlpipeline-minio-artifact\nexecutor:\n imagePullPolicy: IfNotPresent\n name: \"\"\n resources: {}\ninitialDelay: 0s\nmetricsConfig: {}\nnodeEvents: {}\npodSpecLogStrategy: {}\nsso:\n clientId:\n key: \"\"\n clientSecret:\n key: \"\"\n issuer: \"\"\n redirectUrl: \"\"\n sessionExpiry: 0s\ntelemetryConfig: {}\n"
time="2024-08-08T10:24:49.480Z" level=info msg="Persistence configuration disabled"
time="2024-08-08T10:24:49.480Z" level=info executorImage="charmedkubeflow/argoexec:3.4.16-ffcffa9" executorImagePullPolicy=IfNotPresent managedNamespace=
I0808 10:24:49.481417 14 leaderelection.go:248] attempting to acquire leader lease kubeflow/workflow-controller...
time="2024-08-08T10:24:49.481Z" level=info msg="Starting dummy metrics server at localhost:9090/metrics"
I0808 10:24:49.495949 14 leaderelection.go:258] successfully acquired lease kubeflow/workflow-controller
time="2024-08-08T10:24:49.496Z" level=info msg="new leader" leader=argo-controller
time="2024-08-08T10:24:49.499Z" level=info msg="Starting Workflow Controller" defaultRequeueTime=10s version=v3.4.16
time="2024-08-08T10:24:49.499Z" level=info msg="Current Worker Numbers" podCleanup=4 workflowTtlWorkers=4 workflowWorkers=32
time="2024-08-08T10:24:49.499Z" level=info msg="Watching task results" labelSelector="!workflows.argoproj.io/controller-instanceid,workflows.argoproj.io/workflow"
time="2024-08-08T10:24:49.499Z" level=info msg=Plugins executorPlugins=false
time="2024-08-08T10:24:49.502Z" level=info msg="Starting prometheus metrics server at localhost:9090/metrics"
panic: listen tcp :9090: bind: address already in use
goroutine 100 [running]:
github.com/argoproj/argo-workflows/v3/workflow/metrics.runServer.func3()
/go/src/github.com/argoproj/argo-workflows/workflow/metrics/server.go:90 +0x12c
created by github.com/argoproj/argo-workflows/v3/workflow/metrics.runServer
/go/src/github.com/argoproj/argo-workflows/workflow/metrics/server.go:87 +0x3b7
2024-08-08T10:24:49Z ERROR cannot start service: exited quickly with code 2
-----
Additional Context
@DnPlas noted that this issue may be related to this issue that was encountered in the argo-controller charm.
Thank you for reporting us your feedback!
The internal ticket has been created: https://warthogs.atlassian.net/browse/KF-6111.
This message was autogenerated