training-operator Unable to Access Monitoring Port (Prometheus Metrics) on Kubeflow Trainer Controller Manager

What happened?

Facing an issue accessing the monitoring port (or Prometheus metrics) in the Kubeflow Trainer Controller Manager running on the master branch. The instructions that worked in version v1.9.0 are not working in v2.0, and encountering a port-forwarding error.

Error:

$ kubectl port-forward -n kubeflow-system deployment/kubeflow-trainer-controller-manager 8080:8080
Forwarding from 127.0.0.1:8080 -> 8080
Forwarding from [::1]:8080 -> 8080
Handling connection for 8080
E0319 15:13:06.174141 3648479 portforward.go:424] "Unhandled Error" err="an error occurred forwarding 8080 -> 8080: error forwarding port 8080 to pod 71a61495b14b7ae8b610e860acd7bd7a0bd4beb7feb4bd66064ce245150ff339, uid : failed to execute portforward in network namespace \"/var/run/netns/cni-6a5c29c4-f90d-5175-3aeb-cdbca524d613\": failed to connect to localhost:8080 inside namespace \"71a61495b14b7ae8b610e860acd7bd7a0bd4beb7feb4bd66064ce245150ff339\", IPv4: dial tcp4 127.0.0.1:8080: connect: connection refused IPv6 dial tcp6 [::1]:8080: connect: connection refused "
error: lost connection to pod

Setup Instrcutions Followed: cluster setup:

kind create cluster
kubectl apply --server-side -k "https://github.com/kubeflow/trainer.git/manifests/overlays/manager?ref=master"
sleep 120
kubectl apply --server-side -k "https://github.com/kubeflow/trainer.git/manifests/overlays/runtimes?ref=master"

env setup:

conda create --name issue python=3.11
conda activate issue
pip install git+https://github.com/kubeflow/trainer.git@master#subdirectory=sdk
pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cpu

What did you expect to happen?

prometheus metrics should be accessible at localhost:8080/metrics

Environment

Kubernetes version:

$ kubectl version
Client Version: v1.32.1
Kustomize Version: v5.5.0
Server Version: v1.32.1

Kubeflow Trainer version:

$ kubectl get pods -n kubeflow-system -l app.kubernetes.io/name=trainer -o jsonpath="{.items[*].spec.containers[*].image}"
ghcr.io/kubeflow/trainer/trainer-controller-manager:latest

Kubeflow Python SDK version:

$ pip show kubeflow
Name: kubeflow
Version: 0.1.0
Summary: Kubeflow Python SDK to manage ML workloads and to interact with Kubeflow APIs.
Home-page: https://github.com/kubeflow/trainer
Author: 
Author-email: The Kubeflow Authors <[email protected]>
License: Apache License
Location: /home/izuku/miniconda3/envs/issue2/lib/python3.11/site-packages
Requires: kubernetes, pydantic
Required-by:

Impacted by this bug?

Give it a 👍 We prioritize the issues with most 👍

Mar 19 '25 10:03 izuku-sds

@tenzen-y related to above issue, as @milinddethe15 mentioned that by default metrics are disabled. Should I change that.

Mar 20 '25 13:03 izuku-sds

Adding following args to deployment yaml, gave access to monitoring port using port-forwarding.

args:
- "--metrics-bind-address=:8080"
- "--metrics-secure=false"

port forwarding:

$ kubectl port-forward -n kubeflow-system deployment/kubeflow-trainer-controller-manager 8080:8080

Mar 21 '25 20:03 izuku-sds

@Electronic-Waste: The label(s) /remove-label lifecyle/needs-triage cannot be applied. These labels are supported: tide/merge-method-merge, tide/merge-method-rebase, tide/merge-method-squash, lifecycle/needs-triage. Is this label configured under labels -> additional_labels or labels -> restricted_labels in plugin.yaml?

In response to this:

/remove-label lifecyle/needs-triage

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Apr 09 '25 13:04 google-oss-prow[bot]

/remove-label lifecycle/needs-triage

Apr 09 '25 13:04 Electronic-Waste

/area monitoring

Apr 09 '25 13:04 Electronic-Waste

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

Jul 08 '25 15:07 github-actions[bot]

/remove-lifecycle stale /good-first-issue

Jul 08 '25 15:07 andreyvelich

@andreyvelich: This request has been marked as suitable for new contributors.

Please ensure the request meets the requirements listed here.

If this request no longer meets these requirements, the label can be removed by commenting with the /remove-good-first-issue command.

In response to this:

/remove-lifecycle stale /good-first-issue

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Jul 08 '25 15:07 google-oss-prow[bot]

I want to pick this issue so assigning to me /assign

Sep 29 '25 07:09 ChughShilpa