ci-tools icon indicating copy to clipboard operation
ci-tools copied to clipboard

DPTP-4613: pod-scaler implement measured pods for accurate resource measurement

Open deepsm007 opened this issue 1 month ago • 17 comments

Implements measured pods feature to address CPU bottleneck issues from node contention. Pods are automatically classified as "normal" or "measured" based on measurement freshness (>10 days), with measured pods running on isolated nodes via pod anti-affinity rules. Integrates BigQuery to query and cache measured pod data from ci_operator_metrics table, and adds data collection component that queries Prometheus for completed measured pods and writes max CPU/memory usage to BigQuery. Applies measured resource recommendations to longest-running containers using actual measured utilization instead of Prometheus data skewed by node contention. This enables accurate resource recommendations and addresses the CPU bottleneck issue (DPTP-4613) blocking authoritative CPU mode.

/cc @openshift/test-platform

deepsm007 avatar Jan 06 '26 16:01 deepsm007

Pipeline controller notification This repo is configured to use the pipeline controller. Second-stage tests will be triggered either automatically or after lgtm label is added, depending on the repository configuration. The pipeline controller will automatically detect which contexts are required and will utilize /test Prow commands to trigger the second stage.

For optional jobs, comment /test ? to see a list of all defined jobs. To trigger manually all jobs from second stage use /pipeline required command.

This repository is configured in: automatic mode

openshift-ci-robot avatar Jan 06 '26 16:01 openshift-ci-robot

@deepsm007: This pull request references DPTP-4613 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.22.0" version, but no target version was set.

In response to this:

• Adds pod classification system that labels pods as "normal" or "measured" based on whether they need fresh resource measurement data (measured if last measurement >10 days ago or never measured) • Implements podAntiAffinity rules to ensure measured pods run on isolated nodes with no other CI workloads, allowing accurate CPU/memory utilization measurement without node contention • Integrates BigQuery client to query and cache max CPU/memory utilization from measured pod runs, refreshing daily to keep data current • Applies measured resource recommendations only to the longest-running container in each pod, using actual utilization data instead of Prometheus metrics that may be skewed by node contention

/cc @openshift/test-platform

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

openshift-ci-robot avatar Jan 06 '26 16:01 openshift-ci-robot

Walkthrough

This PR integrates measured pod support into the pod-scaler system by adding BigQuery-backed data collection and caching. It extends the admission and producer flows to classify pods, apply anti-affinity rules, and adjust resource constraints based on measured utilization. A retry mechanism is added to manifest pushing for reliability.

Changes

Cohort / File(s) Summary
Measured Pod Core Implementation
cmd/pod-scaler/measured.go, cmd/pod-scaler/measured_test.go
New module implementing measured pod data pipeline: defines MeasuredPodData, MeasuredPodCache, and BigQueryClient types; introduces ClassifyPod, AddPodAntiAffinity, and ApplyMeasuredPodResources functions; handles BigQuery querying, caching with thread-safety, pod classification by metadata, anti-affinity rule generation, and resource scaling with 20% safety margin. Test file covers ShouldBeMeasured, ClassifyPod, AddPodAntiAffinity, and cache operations.
Main Flow Integration
cmd/pod-scaler/main.go
Extends producerOptions struct with enableMeasuredPods, bigQueryProjectID, bigQueryDatasetID, bigQueryCredentialsFile fields. Updates mainProduce and mainAdmission to conditionally create BigQuery client, validate configuration, and pass client downstream.
Admission Integration
cmd/pod-scaler/admission.go
Adds bqClient parameter to admit function and podMutator struct. Augments handle flow to classify pods, apply anti-affinity, and apply measured resource constraints when BigQueryClient is present.
Producer Integration
cmd/pod-scaler/producer.go
Extends produce function to accept BigQueryClient parameter. Adds collectMeasuredPodMetrics, extractMeasuredPodData, queryPodMetrics, and writeMeasuredPodsToBigQuery functions. Introduces escapePromQLLabelValue helper and measuredPodData, bigQueryPodMetricsRow types. Collects per-container CPU/memory metrics and container durations, then persists to BigQuery.
Manifest Pushing Enhancement
pkg/manifestpusher/manifestpusher.go
Adds architecture extraction from build nodeSelector and per-build image reference logging. Replaces direct manifest-list push with retry mechanism using exponential backoff to handle race conditions where images may not be immediately available. Detects missing image errors to trigger retries; other errors fail immediately.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

✨ Finishing touches
  • [ ] 📝 Generate docstrings

Comment @coderabbitai help to get the list of available commands and usage tips.

coderabbitai[bot] avatar Jan 06 '26 16:01 coderabbitai[bot]

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: deepsm007

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment Approvers can cancel approval by writing /approve cancel in a comment

openshift-ci[bot] avatar Jan 06 '26 16:01 openshift-ci[bot]

/hold and wait for https://github.com/openshift/ci-tools/pull/4886 to be reviewd/merged

deepsm007 avatar Jan 07 '26 16:01 deepsm007

Scheduling required tests: /test e2e

Scheduling tests matching the pipeline_run_if_changed or not excluded by pipeline_skip_if_only_changed parameters: /test integration-optional-test

openshift-ci-robot avatar Jan 07 '26 16:01 openshift-ci-robot

/test images /pipeline required

deepsm007 avatar Jan 07 '26 20:01 deepsm007

Scheduling required tests: /test e2e

Scheduling tests matching the pipeline_run_if_changed or not excluded by pipeline_skip_if_only_changed parameters: /test integration-optional-test

openshift-ci-robot avatar Jan 07 '26 20:01 openshift-ci-robot

@deepsm007: The specified target(s) for /test were not found. The following commands are available to trigger required jobs:

/test checkconfig
/test codegen
/test e2e
/test format
/test frontend-checks
/test images
/test integration
/test integration-optional-test
/test lint
/test unit
/test validate-vendor

The following commands are available to trigger optional jobs:

/test breaking-changes
/test e2e-oo
/test security

Use /test all to run the following jobs that were automatically triggered:

pull-ci-openshift-ci-tools-main-breaking-changes
pull-ci-openshift-ci-tools-main-checkconfig
pull-ci-openshift-ci-tools-main-codegen
pull-ci-openshift-ci-tools-main-format
pull-ci-openshift-ci-tools-main-frontend-checks
pull-ci-openshift-ci-tools-main-images
pull-ci-openshift-ci-tools-main-integration
pull-ci-openshift-ci-tools-main-lint
pull-ci-openshift-ci-tools-main-security
pull-ci-openshift-ci-tools-main-unit
pull-ci-openshift-ci-tools-main-validate-vendor

In response to this:

/test image /pipeline required

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

openshift-ci[bot] avatar Jan 07 '26 20:01 openshift-ci[bot]

Scheduling required tests: /test e2e

Scheduling tests matching the pipeline_run_if_changed or not excluded by pipeline_skip_if_only_changed parameters: /test integration-optional-test

openshift-ci-robot avatar Jan 07 '26 20:01 openshift-ci-robot

/test images

deepsm007 avatar Jan 08 '26 14:01 deepsm007

@deepsm007: This pull request references DPTP-4613 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.22.0" version, but no target version was set.

In response to this:

Implements measured pods feature to address CPU bottleneck issues from node contention. Pods are automatically classified as "normal" or "measured" based on measurement freshness (>10 days), with measured pods running on isolated nodes via pod anti-affinity rules. Integrates BigQuery to query and cache measured pod data from ci_operator_metrics table, and adds data collection component that queries Prometheus for completed measured pods and writes max CPU/memory usage to BigQuery. Applies measured resource recommendations to longest-running containers using actual measured utilization instead of Prometheus data skewed by node contention. This enables accurate resource recommendations and addresses the CPU bottleneck issue (DPTP-4613) blocking authoritative CPU mode.

/cc @openshift/test-platform

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

openshift-ci-robot avatar Jan 12 '26 21:01 openshift-ci-robot

/pipeline required

deepsm007 avatar Jan 13 '26 13:01 deepsm007

Scheduling required tests: /test e2e

Scheduling tests matching the pipeline_run_if_changed or not excluded by pipeline_skip_if_only_changed parameters: /test integration-optional-test

openshift-ci-robot avatar Jan 13 '26 13:01 openshift-ci-robot

Scheduling required tests: /test e2e

Scheduling tests matching the pipeline_run_if_changed or not excluded by pipeline_skip_if_only_changed parameters: /test integration-optional-test

openshift-ci-robot avatar Jan 13 '26 17:01 openshift-ci-robot

Scheduling required tests: /test e2e

Scheduling tests matching the pipeline_run_if_changed or not excluded by pipeline_skip_if_only_changed parameters: /test integration-optional-test

openshift-ci-robot avatar Jan 14 '26 16:01 openshift-ci-robot

@deepsm007: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/breaking-changes bfc4f68ef615314e08d5486ac3d988d65d63fcd3 link false /test breaking-changes

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

openshift-ci[bot] avatar Jan 14 '26 17:01 openshift-ci[bot]