Inconsistent results from datagrepper API for specific version queries

Open rrasouli opened this issue 8 months ago • 1 comments

Description of the issue

I'm experiencing unstable and inconsistent results when querying the datagrepper API for Windows Machine Config Operator (WMCO) index images. Specifically:

Queries for some versions (e.g., 4.18) work sometimes but fail other times
Queries for newer versions (e.g., 4.19) consistently return no results
The inconsistency breaks automation that relies on this data to dynamically discover and mirror WMCO index images, especially for disconnected or Konflux-driven deployments

Reproduction steps

I'm using the following shell function to query for WMCO index images for specific OpenShift versions:

get_latest_wmco_index_image() {
  local version
  version="4.18"  # Also tried with "4.19"
  local ocp_tag="release-${version//./-}"
  local start=$(date -d '60 days ago' +%s)
  local end=$(date +%s)
  curl -s "https://datagrepper.engineering.redhat.com/raw?topic=/topic/VirtualTopic.eng.iib.build.state&contains=windows-machine-conf-tenant&start=${start}&end=${end}" \
    | jq -r --arg tag "$ocp_tag" '
        .raw_messages[]
        | select(.msg.index_image_resolved != null and .msg.state == "complete")
        | select(.msg.fbc_fragment | test($tag))
        | .msg.index_image_resolved' \
    | head -n1
}

### Expected behavior
The function should consistently return the latest index image for both OCP 4.18 and 4.19 versions if they exist.

### Actual behavior
- For version 4.18: Currently returns a result (`registry-proxy.engineering.redhat.com/rh-osbs/iib@sha256:752977b3d70ef3a1669c2cf9e4a466b18db5395615b3bb9ad01de60b6a0cbbb5`), but previously on April 8th, it was returning no results despite valid builds existing.
- For version 4.19: Consistently returns no results, even when increasing time ranges.

### API response analysis
To debug this issue, I created diagnostic commands that save raw API outputs and check for matching entries:

```bash
# Output for OCP 4.18
get_wmco_api_output 4.18
Querying for OCP version 4.18 (tag: release-4-18)
Time range: 1739977629 to 1745161629
Raw API output saved to: ./datagrepper_debug/datagrepper_wmco_4.18_raw.json
Filtered results saved to: ./datagrepper_debug/datagrepper_wmco_4.18_filtered.json
Number of matching entries: 1
Index images found:
registry-proxy.engineering.redhat.com/rh-osbs/iib@sha256:752977b3d70ef3a1669c2cf9e4a466b18db5395615b3bb9ad01de60b6a0cbbb5

# Output for OCP 4.19
get_wmco_api_output 4.19
Querying for OCP version 4.19 (tag: release-4-19)
Time range: 1739977680 to 1745161680
Raw API output saved to: ./datagrepper_debug/datagrepper_wmco_4.19_raw.json
Filtered results saved to: ./datagrepper_debug/datagrepper_wmco_4.19_filtered.json
Number of matching entries: 0
Index images found:

Additionally, I checked all available fbc_fragments in the API responses:

check_fbc_fragments
Examining all fbc_fragments in time range: 1739977724 to 1745161724
All fbc_fragments saved to: ./datagrepper_debug/datagrepper_wmco_all_fragments.json
Unique fragment patterns found:
quay.io/redhat-user-workloads/windows-machine-conf-tenant/windows-machine-config-operator-fbc/windows-machine-config-operator-fbc-master@sha256:7b06c97490346b03660b4458abcffe8f9695d48cffbe959833f8105158e5af3f
quay.io/redhat-user-workloads/windows-machine-conf-tenant/windows-machine-config-operator-fbc/windows-machine-config-operator-fbc-release-4-12@sha256:e62d4db6582d32dc628d138449aace9e996fa5b7244fa282ac42ae2fc5562f45
quay.io/redhat-user-workloads/windows-machine-conf-tenant/windows-machine-config-operator-fbc/windows-machine-config-operator-fbc-release-4-14@sha256:550ea14b4bf0a0694d42d1bb1736ae354c701eb7c03c4e08fdf93b0202281528
quay.io/redhat-user-workloads/windows-machine-conf-tenant/windows-machine-config-operator-fbc/windows-machine-config-operator-fbc-release-4-15@sha256:3216a0e3058c19366eb74554922974b2b02bdb294feb6033498ee9306791526b
quay.io/redhat-user-workloads/windows-machine-conf-tenant/windows-machine-config-operator-fbc/windows-machine-config-operator-fbc-release-4-16@sha256:c649188e9e556a49415e55cf127b4eeade6db8b040cc4536f241a4806fde7f09
quay.io/redhat-user-workloads/windows-machine-conf-tenant/windows-machine-config-operator-fbc/windows-machine-config-operator-fbc-release-4-17@sha256:c283f2f7990971446ab6f7c5aa3661577e1aadcd31752f864fba98af652a6c4e
quay.io/redhat-user-workloads/windows-machine-conf-tenant/windows-machine-config-operator-fbc/windows-machine-config-operator-fbc-release-4-18@sha256:2c78a510ccedc5538aaf8c245ceb52bbdfa4397d8a0a7e28b7b672ad92d3124e

This confirms that entries exist for 4.18 but not for 4.19, suggesting the issue might be related to either:

Missing builds for 4.19
Data not being properly indexed in datagrepper
Inconsistent data publication to the messaging bus

Related Issues

This appears to be similar to a previously reported issue: [KFLUXSPRT-2641](https://issues.redhat.com/browse/KFLUXSPRT-2641), where queries filtered by the expected fbc_fragment tag format were returning empty results despite valid image builds existing.

Impact

This inconsistency breaks automation workflows that rely on datagrepper to programmatically discover and mirror the latest WMCO index images for different OpenShift versions, especially for disconnected or Konflux-driven deployments.

Business Impact

This inconsistency in the datagrepper API has several critical impacts:

Automation Reliability: Our CI/CD pipelines depend on being able to consistently discover the latest WMCO index images for different OpenShift versions.
Disconnected Environments: For customers in air-gapped or disconnected environments, our tools need to pre-mirror these images, which requires reliable discovery.
Risk of Missing Updates: When the API fails to return results for certain versions (like 4.19), environments may miss critical security or feature updates.
Engineering Time: The intermittent nature of the issue leads to significant debugging time and workarounds that could be better spent on feature development.

Previous Workarounds

We've attempted several workarounds including:

Extending the time range (up to 90 days) for the API query
Implementing retry logic with exponential backoff
Falling back to manually identified images when automation fails

None of these approaches provide a sustainable solution for reliable automation.

Proposed Solutions

Potential fixes might include:

Ensuring consistent fbc_fragment formatting across all versions
Adding better error reporting to the API when no results are found
Creating a more direct API endpoint specifically for discovering the latest index images by version

Environment

OS: Linux
Shell: /usr/bin/zsh (GNU bash, version 4.4.20(1)-release (x86_64-redhat-linux-gnu))
curl version: curl 7.61.1 (x86_64-redhat-linux-gnu) libcurl/7.61.1 OpenSSL/1.1.1k zlib/1.2.11 brotli/1.0.6 libidn2/2.2.0 libpsl/0.20.2 (+libidn2/2.2.0) libssh/0.9.6/openssl/zlib nghttp2/1.33.0
jq version: jq-1.6

Additional context

I've tried different time ranges beyond the default 60 days but still encounter the same issues. These images are critical for automating Windows node provisioning in OpenShift environments.


[datagrepper_wmco_4.18_filtered.json](https://github.com/user-attachments/files/19825870/datagrepper_wmco_4.18_filtered.json)
[datagrepper_wmco_4.18_raw.json](https://github.com/user-attachments/files/19825873/datagrepper_wmco_4.18_raw.json)
[datagrepper_wmco_4.19_filtered.json](https://github.com/user-attachments/files/19825871/datagrepper_wmco_4.19_filtered.json)
[datagrepper_wmco_4.19_raw.json](https://github.com/user-attachments/files/19825872/datagrepper_wmco_4.19_raw.json)
[datagrepper_wmco_all_fragments.json](https://github.com/user-attachments/files/19825869/datagrepper_wmco_all_fragments.json)

[datagrepper_wmco_4.18_raw_redacted.json](https://github.com/user-attachments/files/19825887/datagrepper_wmco_4.18_raw_redacted.json)
[datagrepper_wmco_4.19_raw_redacted.json](https://github.com/user-attachments/files/19825889/datagrepper_wmco_4.19_raw_redacted.json)
[datagrepper_wmco_all_fragments_redacted.json](https://github.com/user-attachments/files/19825888/datagrepper_wmco_all_fragments_redacted.json)

Apr 20 '25 15:04 rrasouli

Hey! Unfortunately I'm not aware of this datagrepper deployment. Could you please contact the person in Red Hat that is responsible for it? I'm happy to help debug things if I have access to the logs, but it's definitely not the deployment that Fedora uses, so I know nothing of its configuration (datagrepper and datanommer alike). Thanks!

Apr 22 '25 07:04 abompard