Inconsistent results from datagrepper API for specific version queries
Description of the issue
I'm experiencing unstable and inconsistent results when querying the datagrepper API for Windows Machine Config Operator (WMCO) index images. Specifically:
- Queries for some versions (e.g., 4.18) work sometimes but fail other times
- Queries for newer versions (e.g., 4.19) consistently return no results
- The inconsistency breaks automation that relies on this data to dynamically discover and mirror WMCO index images, especially for disconnected or Konflux-driven deployments
Reproduction steps
I'm using the following shell function to query for WMCO index images for specific OpenShift versions:
get_latest_wmco_index_image() {
local version
version="4.18" # Also tried with "4.19"
local ocp_tag="release-${version//./-}"
local start=$(date -d '60 days ago' +%s)
local end=$(date +%s)
curl -s "https://datagrepper.engineering.redhat.com/raw?topic=/topic/VirtualTopic.eng.iib.build.state&contains=windows-machine-conf-tenant&start=${start}&end=${end}" \
| jq -r --arg tag "$ocp_tag" '
.raw_messages[]
| select(.msg.index_image_resolved != null and .msg.state == "complete")
| select(.msg.fbc_fragment | test($tag))
| .msg.index_image_resolved' \
| head -n1
}
### Expected behavior
The function should consistently return the latest index image for both OCP 4.18 and 4.19 versions if they exist.
### Actual behavior
- For version 4.18: Currently returns a result (`registry-proxy.engineering.redhat.com/rh-osbs/iib@sha256:752977b3d70ef3a1669c2cf9e4a466b18db5395615b3bb9ad01de60b6a0cbbb5`), but previously on April 8th, it was returning no results despite valid builds existing.
- For version 4.19: Consistently returns no results, even when increasing time ranges.
### API response analysis
To debug this issue, I created diagnostic commands that save raw API outputs and check for matching entries:
```bash
# Output for OCP 4.18
get_wmco_api_output 4.18
Querying for OCP version 4.18 (tag: release-4-18)
Time range: 1739977629 to 1745161629
Raw API output saved to: ./datagrepper_debug/datagrepper_wmco_4.18_raw.json
Filtered results saved to: ./datagrepper_debug/datagrepper_wmco_4.18_filtered.json
Number of matching entries: 1
Index images found:
registry-proxy.engineering.redhat.com/rh-osbs/iib@sha256:752977b3d70ef3a1669c2cf9e4a466b18db5395615b3bb9ad01de60b6a0cbbb5
# Output for OCP 4.19
get_wmco_api_output 4.19
Querying for OCP version 4.19 (tag: release-4-19)
Time range: 1739977680 to 1745161680
Raw API output saved to: ./datagrepper_debug/datagrepper_wmco_4.19_raw.json
Filtered results saved to: ./datagrepper_debug/datagrepper_wmco_4.19_filtered.json
Number of matching entries: 0
Index images found:
Additionally, I checked all available fbc_fragments in the API responses:
check_fbc_fragments
Examining all fbc_fragments in time range: 1739977724 to 1745161724
All fbc_fragments saved to: ./datagrepper_debug/datagrepper_wmco_all_fragments.json
Unique fragment patterns found:
quay.io/redhat-user-workloads/windows-machine-conf-tenant/windows-machine-config-operator-fbc/windows-machine-config-operator-fbc-master@sha256:7b06c97490346b03660b4458abcffe8f9695d48cffbe959833f8105158e5af3f
quay.io/redhat-user-workloads/windows-machine-conf-tenant/windows-machine-config-operator-fbc/windows-machine-config-operator-fbc-release-4-12@sha256:e62d4db6582d32dc628d138449aace9e996fa5b7244fa282ac42ae2fc5562f45
quay.io/redhat-user-workloads/windows-machine-conf-tenant/windows-machine-config-operator-fbc/windows-machine-config-operator-fbc-release-4-14@sha256:550ea14b4bf0a0694d42d1bb1736ae354c701eb7c03c4e08fdf93b0202281528
quay.io/redhat-user-workloads/windows-machine-conf-tenant/windows-machine-config-operator-fbc/windows-machine-config-operator-fbc-release-4-15@sha256:3216a0e3058c19366eb74554922974b2b02bdb294feb6033498ee9306791526b
quay.io/redhat-user-workloads/windows-machine-conf-tenant/windows-machine-config-operator-fbc/windows-machine-config-operator-fbc-release-4-16@sha256:c649188e9e556a49415e55cf127b4eeade6db8b040cc4536f241a4806fde7f09
quay.io/redhat-user-workloads/windows-machine-conf-tenant/windows-machine-config-operator-fbc/windows-machine-config-operator-fbc-release-4-17@sha256:c283f2f7990971446ab6f7c5aa3661577e1aadcd31752f864fba98af652a6c4e
quay.io/redhat-user-workloads/windows-machine-conf-tenant/windows-machine-config-operator-fbc/windows-machine-config-operator-fbc-release-4-18@sha256:2c78a510ccedc5538aaf8c245ceb52bbdfa4397d8a0a7e28b7b672ad92d3124e
This confirms that entries exist for 4.18 but not for 4.19, suggesting the issue might be related to either:
- Missing builds for 4.19
- Data not being properly indexed in datagrepper
- Inconsistent data publication to the messaging bus
Related Issues
This appears to be similar to a previously reported issue: [KFLUXSPRT-2641](https://issues.redhat.com/browse/KFLUXSPRT-2641), where queries filtered by the expected fbc_fragment tag format were returning empty results despite valid image builds existing.
Impact
This inconsistency breaks automation workflows that rely on datagrepper to programmatically discover and mirror the latest WMCO index images for different OpenShift versions, especially for disconnected or Konflux-driven deployments.
Business Impact
This inconsistency in the datagrepper API has several critical impacts:
-
Automation Reliability: Our CI/CD pipelines depend on being able to consistently discover the latest WMCO index images for different OpenShift versions.
-
Disconnected Environments: For customers in air-gapped or disconnected environments, our tools need to pre-mirror these images, which requires reliable discovery.
-
Risk of Missing Updates: When the API fails to return results for certain versions (like 4.19), environments may miss critical security or feature updates.
-
Engineering Time: The intermittent nature of the issue leads to significant debugging time and workarounds that could be better spent on feature development.
Previous Workarounds
We've attempted several workarounds including:
- Extending the time range (up to 90 days) for the API query
- Implementing retry logic with exponential backoff
- Falling back to manually identified images when automation fails
None of these approaches provide a sustainable solution for reliable automation.
Proposed Solutions
Potential fixes might include:
- Ensuring consistent fbc_fragment formatting across all versions
- Adding better error reporting to the API when no results are found
- Creating a more direct API endpoint specifically for discovering the latest index images by version
Environment
- OS: Linux
- Shell: /usr/bin/zsh (GNU bash, version 4.4.20(1)-release (x86_64-redhat-linux-gnu))
- curl version: curl 7.61.1 (x86_64-redhat-linux-gnu) libcurl/7.61.1 OpenSSL/1.1.1k zlib/1.2.11 brotli/1.0.6 libidn2/2.2.0 libpsl/0.20.2 (+libidn2/2.2.0) libssh/0.9.6/openssl/zlib nghttp2/1.33.0
- jq version: jq-1.6
Additional context
I've tried different time ranges beyond the default 60 days but still encounter the same issues. These images are critical for automating Windows node provisioning in OpenShift environments.
[datagrepper_wmco_4.18_filtered.json](https://github.com/user-attachments/files/19825870/datagrepper_wmco_4.18_filtered.json)
[datagrepper_wmco_4.18_raw.json](https://github.com/user-attachments/files/19825873/datagrepper_wmco_4.18_raw.json)
[datagrepper_wmco_4.19_filtered.json](https://github.com/user-attachments/files/19825871/datagrepper_wmco_4.19_filtered.json)
[datagrepper_wmco_4.19_raw.json](https://github.com/user-attachments/files/19825872/datagrepper_wmco_4.19_raw.json)
[datagrepper_wmco_all_fragments.json](https://github.com/user-attachments/files/19825869/datagrepper_wmco_all_fragments.json)
[datagrepper_wmco_4.18_raw_redacted.json](https://github.com/user-attachments/files/19825887/datagrepper_wmco_4.18_raw_redacted.json)
[datagrepper_wmco_4.19_raw_redacted.json](https://github.com/user-attachments/files/19825889/datagrepper_wmco_4.19_raw_redacted.json)
[datagrepper_wmco_all_fragments_redacted.json](https://github.com/user-attachments/files/19825888/datagrepper_wmco_all_fragments_redacted.json)
Hey! Unfortunately I'm not aware of this datagrepper deployment. Could you please contact the person in Red Hat that is responsible for it? I'm happy to help debug things if I have access to the logs, but it's definitely not the deployment that Fedora uses, so I know nothing of its configuration (datagrepper and datanommer alike). Thanks!