Unable to Access Monitoring Port (Prometheus Metrics) on Kubeflow Trainer Controller Manager
What happened?
Facing an issue accessing the monitoring port (or Prometheus metrics) in the Kubeflow Trainer Controller Manager running on the master branch. The instructions that worked in version v1.9.0 are not working in v2.0, and encountering a port-forwarding error.
Error:
$ kubectl port-forward -n kubeflow-system deployment/kubeflow-trainer-controller-manager 8080:8080
Forwarding from 127.0.0.1:8080 -> 8080
Forwarding from [::1]:8080 -> 8080
Handling connection for 8080
E0319 15:13:06.174141 3648479 portforward.go:424] "Unhandled Error" err="an error occurred forwarding 8080 -> 8080: error forwarding port 8080 to pod 71a61495b14b7ae8b610e860acd7bd7a0bd4beb7feb4bd66064ce245150ff339, uid : failed to execute portforward in network namespace \"/var/run/netns/cni-6a5c29c4-f90d-5175-3aeb-cdbca524d613\": failed to connect to localhost:8080 inside namespace \"71a61495b14b7ae8b610e860acd7bd7a0bd4beb7feb4bd66064ce245150ff339\", IPv4: dial tcp4 127.0.0.1:8080: connect: connection refused IPv6 dial tcp6 [::1]:8080: connect: connection refused "
error: lost connection to pod
Setup Instrcutions Followed: cluster setup:
kind create cluster
kubectl apply --server-side -k "https://github.com/kubeflow/trainer.git/manifests/overlays/manager?ref=master"
sleep 120
kubectl apply --server-side -k "https://github.com/kubeflow/trainer.git/manifests/overlays/runtimes?ref=master"
env setup:
conda create --name issue python=3.11
conda activate issue
pip install git+https://github.com/kubeflow/trainer.git@master#subdirectory=sdk
pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cpu
What did you expect to happen?
prometheus metrics should be accessible at localhost:8080/metrics
Environment
Kubernetes version:
$ kubectl version
Client Version: v1.32.1
Kustomize Version: v5.5.0
Server Version: v1.32.1
Kubeflow Trainer version:
$ kubectl get pods -n kubeflow-system -l app.kubernetes.io/name=trainer -o jsonpath="{.items[*].spec.containers[*].image}"
ghcr.io/kubeflow/trainer/trainer-controller-manager:latest
Kubeflow Python SDK version:
$ pip show kubeflow
Name: kubeflow
Version: 0.1.0
Summary: Kubeflow Python SDK to manage ML workloads and to interact with Kubeflow APIs.
Home-page: https://github.com/kubeflow/trainer
Author:
Author-email: The Kubeflow Authors <[email protected]>
License: Apache License
Location: /home/izuku/miniconda3/envs/issue2/lib/python3.11/site-packages
Requires: kubernetes, pydantic
Required-by:
Impacted by this bug?
Give it a 👍 We prioritize the issues with most 👍
@tenzen-y related to above issue, as @milinddethe15 mentioned that by default metrics are disabled. Should I change that.
Adding following args to deployment yaml, gave access to monitoring port using port-forwarding.
args:
- "--metrics-bind-address=:8080"
- "--metrics-secure=false"
port forwarding:
$ kubectl port-forward -n kubeflow-system deployment/kubeflow-trainer-controller-manager 8080:8080
@Electronic-Waste: The label(s) /remove-label lifecyle/needs-triage cannot be applied. These labels are supported: tide/merge-method-merge, tide/merge-method-rebase, tide/merge-method-squash, lifecycle/needs-triage. Is this label configured under labels -> additional_labels or labels -> restricted_labels in plugin.yaml?
In response to this:
/remove-label lifecyle/needs-triage
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.
/remove-label lifecycle/needs-triage
/area monitoring
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
/remove-lifecycle stale /good-first-issue
@andreyvelich: This request has been marked as suitable for new contributors.
Please ensure the request meets the requirements listed here.
If this request no longer meets these requirements, the label can be removed
by commenting with the /remove-good-first-issue command.
In response to this:
/remove-lifecycle stale /good-first-issue
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.
I want to pick this issue so assigning to me /assign