DPTP-4613: pod-scaler implement measured pods for accurate resource measurement
Implements measured pods feature to address CPU bottleneck issues from node contention. Pods are automatically classified as "normal" or "measured" based on measurement freshness (>10 days), with measured pods running on isolated nodes via pod anti-affinity rules. Integrates BigQuery to query and cache measured pod data from ci_operator_metrics table, and adds data collection component that queries Prometheus for completed measured pods and writes max CPU/memory usage to BigQuery. Applies measured resource recommendations to longest-running containers using actual measured utilization instead of Prometheus data skewed by node contention. This enables accurate resource recommendations and addresses the CPU bottleneck issue (DPTP-4613) blocking authoritative CPU mode.
/cc @openshift/test-platform
Pipeline controller notification
This repo is configured to use the pipeline controller. Second-stage tests will be triggered either automatically or after lgtm label is added, depending on the repository configuration. The pipeline controller will automatically detect which contexts are required and will utilize /test Prow commands to trigger the second stage.
For optional jobs, comment /test ? to see a list of all defined jobs. To trigger manually all jobs from second stage use /pipeline required command.
This repository is configured in: automatic mode
@deepsm007: This pull request references DPTP-4613 which is a valid jira issue.
Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.22.0" version, but no target version was set.
In response to this:
• Adds pod classification system that labels pods as "normal" or "measured" based on whether they need fresh resource measurement data (measured if last measurement >10 days ago or never measured) • Implements podAntiAffinity rules to ensure measured pods run on isolated nodes with no other CI workloads, allowing accurate CPU/memory utilization measurement without node contention • Integrates BigQuery client to query and cache max CPU/memory utilization from measured pod runs, refreshing daily to keep data current • Applies measured resource recommendations only to the longest-running container in each pod, using actual utilization data instead of Prometheus metrics that may be skewed by node contention
/cc @openshift/test-platform
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.
Walkthrough
This PR integrates measured pod support into the pod-scaler system by adding BigQuery-backed data collection and caching. It extends the admission and producer flows to classify pods, apply anti-affinity rules, and adjust resource constraints based on measured utilization. A retry mechanism is added to manifest pushing for reliability.
Changes
| Cohort / File(s) | Summary |
|---|---|
Measured Pod Core Implementation cmd/pod-scaler/measured.go, cmd/pod-scaler/measured_test.go |
New module implementing measured pod data pipeline: defines MeasuredPodData, MeasuredPodCache, and BigQueryClient types; introduces ClassifyPod, AddPodAntiAffinity, and ApplyMeasuredPodResources functions; handles BigQuery querying, caching with thread-safety, pod classification by metadata, anti-affinity rule generation, and resource scaling with 20% safety margin. Test file covers ShouldBeMeasured, ClassifyPod, AddPodAntiAffinity, and cache operations. |
Main Flow Integration cmd/pod-scaler/main.go |
Extends producerOptions struct with enableMeasuredPods, bigQueryProjectID, bigQueryDatasetID, bigQueryCredentialsFile fields. Updates mainProduce and mainAdmission to conditionally create BigQuery client, validate configuration, and pass client downstream. |
Admission Integration cmd/pod-scaler/admission.go |
Adds bqClient parameter to admit function and podMutator struct. Augments handle flow to classify pods, apply anti-affinity, and apply measured resource constraints when BigQueryClient is present. |
Producer Integration cmd/pod-scaler/producer.go |
Extends produce function to accept BigQueryClient parameter. Adds collectMeasuredPodMetrics, extractMeasuredPodData, queryPodMetrics, and writeMeasuredPodsToBigQuery functions. Introduces escapePromQLLabelValue helper and measuredPodData, bigQueryPodMetricsRow types. Collects per-container CPU/memory metrics and container durations, then persists to BigQuery. |
Manifest Pushing Enhancement pkg/manifestpusher/manifestpusher.go |
Adds architecture extraction from build nodeSelector and per-build image reference logging. Replaces direct manifest-list push with retry mechanism using exponential backoff to handle race conditions where images may not be immediately available. Detects missing image errors to trigger retries; other errors fail immediately. |
Estimated code review effort
🎯 4 (Complex) | ⏱️ ~50 minutes
✨ Finishing touches
- [ ] 📝 Generate docstrings
Comment @coderabbitai help to get the list of available commands and usage tips.
[APPROVALNOTIFIER] This PR is APPROVED
This pull-request has been approved by: deepsm007
The full list of commands accepted by this bot can be found here.
The pull request process is described here
- ~~OWNERS~~ [deepsm007]
Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment
/hold and wait for https://github.com/openshift/ci-tools/pull/4886 to be reviewd/merged
Scheduling required tests: /test e2e
Scheduling tests matching the pipeline_run_if_changed or not excluded by pipeline_skip_if_only_changed parameters:
/test integration-optional-test
/test images /pipeline required
Scheduling required tests: /test e2e
Scheduling tests matching the pipeline_run_if_changed or not excluded by pipeline_skip_if_only_changed parameters:
/test integration-optional-test
@deepsm007: The specified target(s) for /test were not found.
The following commands are available to trigger required jobs:
/test checkconfig
/test codegen
/test e2e
/test format
/test frontend-checks
/test images
/test integration
/test integration-optional-test
/test lint
/test unit
/test validate-vendor
The following commands are available to trigger optional jobs:
/test breaking-changes
/test e2e-oo
/test security
Use /test all to run the following jobs that were automatically triggered:
pull-ci-openshift-ci-tools-main-breaking-changes
pull-ci-openshift-ci-tools-main-checkconfig
pull-ci-openshift-ci-tools-main-codegen
pull-ci-openshift-ci-tools-main-format
pull-ci-openshift-ci-tools-main-frontend-checks
pull-ci-openshift-ci-tools-main-images
pull-ci-openshift-ci-tools-main-integration
pull-ci-openshift-ci-tools-main-lint
pull-ci-openshift-ci-tools-main-security
pull-ci-openshift-ci-tools-main-unit
pull-ci-openshift-ci-tools-main-validate-vendor
In response to this:
/test image /pipeline required
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.
Scheduling required tests: /test e2e
Scheduling tests matching the pipeline_run_if_changed or not excluded by pipeline_skip_if_only_changed parameters:
/test integration-optional-test
/test images
@deepsm007: This pull request references DPTP-4613 which is a valid jira issue.
Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.22.0" version, but no target version was set.
In response to this:
Implements measured pods feature to address CPU bottleneck issues from node contention. Pods are automatically classified as "normal" or "measured" based on measurement freshness (>10 days), with measured pods running on isolated nodes via pod anti-affinity rules. Integrates BigQuery to query and cache measured pod data from
ci_operator_metricstable, and adds data collection component that queries Prometheus for completed measured pods and writes max CPU/memory usage to BigQuery. Applies measured resource recommendations to longest-running containers using actual measured utilization instead of Prometheus data skewed by node contention. This enables accurate resource recommendations and addresses the CPU bottleneck issue (DPTP-4613) blocking authoritative CPU mode./cc @openshift/test-platform
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.
/pipeline required
Scheduling required tests: /test e2e
Scheduling tests matching the pipeline_run_if_changed or not excluded by pipeline_skip_if_only_changed parameters:
/test integration-optional-test
Scheduling required tests: /test e2e
Scheduling tests matching the pipeline_run_if_changed or not excluded by pipeline_skip_if_only_changed parameters:
/test integration-optional-test
Scheduling required tests: /test e2e
Scheduling tests matching the pipeline_run_if_changed or not excluded by pipeline_skip_if_only_changed parameters:
/test integration-optional-test
@deepsm007: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:
| Test name | Commit | Details | Required | Rerun command |
|---|---|---|---|---|
| ci/prow/breaking-changes | bfc4f68ef615314e08d5486ac3d988d65d63fcd3 | link | false | /test breaking-changes |
Full PR test history. Your PR dashboard.
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.