gh-action-pypi-publish icon indicating copy to clipboard operation
gh-action-pypi-publish copied to clipboard

[TODO] Explore handling HTTP errors on Rektor flakiness

Open webknjaz opened this issue 6 months ago • 10 comments

@woodruffw @facutuesca we recently saw an HTTP 502 and a traceback in the attestations flow:

Traceback (most recent call last):
  File "/root/.local/lib/python3.12/site-packages/sigstore/_internal/rekor/client.py", line 160, in post
    resp.raise_for_status()
  File "/root/.local/lib/python3.12/site-packages/requests/models.py", line 1024, in raise_for_status
    raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 502 Server Error: Bad Gateway for url: https://rekor.sigstore.dev/api/v1/log/entries/

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/app/attestations.py", line 149, in <module>
    main()
  File "/app/attestations.py", line 145, in main
    attest_dist(dist_path, attestation_path, signer)
  File "/app/attestations.py", line 114, in attest_dist
    attestation = Attestation.sign(signer, dist)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/.local/lib/python3.12/site-packages/pypi_attestations/_impl.py", line 200, in sign
    bundle = signer.sign_dsse(stmt)
             ^^^^^^^^^^^^^^^^^^^^^^
  File "/root/.local/lib/python3.12/site-packages/sigstore/sign.py", line 230, in sign_dsse
    return self._finalize_sign(cert, content, proposed_entry)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/.local/lib/python3.12/site-packages/sigstore/sign.py", line 189, in _finalize_sign
    entry = self._signing_ctx._rekor.log.entries.post(proposed_entry)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/.local/lib/python3.12/site-packages/sigstore/_internal/rekor/client.py", line 162, in post
    raise RekorClientError(http_error)
sigstore._internal.rekor.client.RekorClientError: Rekor returned an unknown error with HTTP 502

(https://github.com/aio-libs/aiohttp/actions/runs/15359675323/job/43225662768#step:9:384)

Mind taking a look?

webknjaz avatar Jun 01 '25 09:06 webknjaz

Maybe, this should be fixed in the sigstore lib. Not sure. If not, I suspect Twine would need some handling as well.

webknjaz avatar Jun 01 '25 10:06 webknjaz

Yeah, it looks like there was some kind of Rekor hiccup/short outage over the weekend -- pinging @haydentherapper since he might know more 🙂

Looks like there are two things sigstore-python could do better here:

  • We could probably produce a more explanatory error here (both for API and CLI users)
  • We should probably have some kind of retry handling for Rekor API calls, although in this case that probably wouldn't have helped much

woodruffw avatar Jun 02 '25 15:06 woodruffw

yeah, that's exactly what I was thinking

webknjaz avatar Jun 02 '25 15:06 webknjaz

Yea, there was a brief outage over the weekend, still investigating root cause.

haydentherapper avatar Jun 02 '25 16:06 haydentherapper

I just wanted to chime in and say this bug has been affecting me for a couple hours, so I had to revert to setting attestations: false in my GitHub Actions workflow. Any idea how frequently this occurs?

fletchapin avatar Aug 06 '25 03:08 fletchapin

We're having an outage at the moment, this appears to be due to our cloud provider, not Rekor itself. We'll update as there's more information.

haydentherapper avatar Aug 06 '25 03:08 haydentherapper

I see, that's interesting because my original action had an identical stack with the Rekor error: https://github.com/we3lab/pype-schema/actions/runs/16766504059/job/47472565963

And with attestations: false it works: https://github.com/we3lab/pype-schema/actions/runs/16766590355

Figured I would share in case it helps debug.

fletchapin avatar Aug 06 '25 03:08 fletchapin

GCP's issue appears to have been resolved, requests are now succeeding.

haydentherapper avatar Aug 06 '25 04:08 haydentherapper

Looks like there was another outage yesterday: #376 / #377. I wonder what can we do about it..

cc @woodruffw @haydentherapper

webknjaz avatar Aug 14 '25 08:08 webknjaz

Yeah, this one looks even more concerning to me: it looks like both Rekor and Fulcio were failing for a period. I've raised it on the Sigstore slack.

woodruffw avatar Aug 14 '25 14:08 woodruffw