test-infra
test-infra copied to clipboard
Add a webhook job reporter
This PR proposes a webhook reporter_config for Prow jobs.
Why
Within our alerting system, we want to perform some analysis before reporting job failures to a human. A configurable webhook eliminates the need for an external system to poll for Prow results. Moreover, the external analysis system does not have to be explicitly configured with each available job, but rather, it can just react to incoming webhooks.
What
In this proposal, the reporter_config.webhook configuration accepts a URL where to POST the JSON-serialised ProwJob information. For future-proofness, the v1.ProwJob object is wrapped in a JSON object that contains a single prow_job property.
Webhook authentication
In order to authenticate the actual webhooks HTTP calls, the calls are digitally signed with a user-provided Ed25519 key. The signature is sent in a x-prow-token header. The timestamp included in the signed payload is passed in a x-prow-timestamp header.
The signed digest is a null-byte-joined concatenation of:
- the target url
- an RFC3339-formatted timestamp of the current time in the UTC timezone
- the request body
Note that this patch does not provide a public key dissemination mechanism.
Review needed
Once (and if) there is agreement on this concept, this proposal needs thorough review on a couple key points, and in particular:
A) the cryptography around webhook authentication
B) the (lack of) public key dissemination
C) accessory functions as ShouldReport
A) the cryptography around webhook authentication
In order to protect target servers, the webhooks must be resistant to tampering and to replay attacks. This proposal uses the native Go stdlib implementation of the Ed25519 signature algorithm. The message digest is passed unhashed to the Sign function and included as a header.
B) the (lack of) public key dissemination
The Prow server (or its administrators) should publish at least a couple keys Prow might use to send webhooks. This PR does not give any hints how or where these public keys should be exposed.
C) accessory functions as ShouldReport
I am no expert in the test-infra tooling; I have basically copy-pasted the code I found in the Slack reporter.
Instructions for the webhook target
In this proposal, the webhook receiver must, before accepting the message:
- check that the timestamp in
x-prow-timestampis reasonably recent; - reconstruct the webhook digest using the instructions above;
- verify that the signature in
x-prow-tokenmatches with (one of) the key(s) provided oob by the Prow administrators
Any suggestion is very welcome.
Hi @pierreprinetti. Thanks for your PR.
I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.
Once the patch is verified, the new status will be reflected by the ok-to-test label.
I understand the commands that are listed here.
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.
/ok-to-test
/cc
I like the feature and want to follow this, but I'm not currently able to give this a full review, don't block on me
This is how GitHub does this: https://docs.github.com/en/developers/webhooks-and-events/webhooks/securing-your-webhooks. They're using an HMAC with the shared key provided by the user. I can't remember how Prow does secret management, but could it be used for this? e.g. Is there a secure user-managed vault a secret could be fetched from? If so, I think it could be simpler to manage.
Delegating secret management to "users" (as in: owners of the repo config) would be perfect. However as far as I know (which is admittedly not very far: please prove me wrong), there is no way for Crier to access per-repository secrets. And since per-repository CI configuration can be considered public (it is certainly public in OpenShift), I wouldn't store secrets there. Hence the idea of using an asymmetric device (ed25519) and letting Prow admins take care of secret management, like they do with the Slack token.
Have you written a client? Could you include it both for reference and some tests?
Good idea; I'm going to include that.
If the purpose of the webhook is to prevent polling, do we need to include the whole job in the payload? Could we include just the job id, then the webhook could go fetch it. Not sure how important this is.
I had the same doubt to be honest. I opted for including the whole job to save a round-trip. Some target servers may even issue zero calls to Prow depending on the job result, if they receive the ProwJob object. However, I am happy to remove everything but the job URL if that's better for any reason.
The GitHub implementation doesn't seem to use a timestamp or nonce afaict. Does that mean that they're requiring webhooks to be vulnerable to a replay attack, or that it's mitigated in some other way I haven't thought of (and by extension we may not need it either).
They might be counting on idempotency? I don't know.
One alternative design for signing that avoids long-lived-key management, could be to have Prow generate one ephemeral ed25519 key per job, and post the public key somewhere trusted (e.g. in the job page). However that would require the target server to make an HTTP call each time it receives a webhook, exposing both the target server and Prow to DoS.
I have added a test to prow/cmd/crier/main_test.go containing the logic that is expected to be implemented on the receiver side. PTAL: https://github.com/kubernetes/test-infra/pull/28341/files#diff-95aeae5b47136e2d3cece771b86541351b023fb404c21e2feb66f221bd25b6baR326-R370
CC @mdbooth
@mdbooth, re: timestamp Since the target server has to accept timestamps within an arbitrary time interval, and that we cannot exclude replays within that interval, perhaps idempotency is a requisite anyway and the timestamp is just a false friend we should remove. WDYT?
/lgtm
@mdbooth, re: timestamp Since the target server has to accept timestamps within an arbitrary time interval, and that we cannot exclude replays within that interval, perhaps idempotency is a requisite anyway and the timestamp is just a false friend we should remove. WDYT?
Do we expect the webhook to be stateful? If the webhook is stateful we can safely do without a timestamp because we can ignore duplicates.
@mdbooth, re: timestamp Since the target server has to accept timestamps within an arbitrary time interval, and that we cannot exclude replays within that interval, perhaps idempotency is a requisite anyway and the timestamp is just a false friend we should remove. WDYT?
Do we expect the webhook to be stateful? If the webhook is stateful we can safely do without a timestamp because we can ignore duplicates.
Maybe we suggest the target server to be stateful, and we still provide a timestamp. Unless someone (you?) has a rationale against that.
After some thought, I think that:
- the key file would be easier to manage if the key was persisted base64-encoded rather than binary
- there is zero reason on Earth to encode the signature in hex rather than base64 in the webhook header
- since the key is parsed from the secret each time it is used, it would be more efficient to encode the whole seed+pubkey block rather than just the seed.
@smg247 PTAL; I have implemented the changes described above.
Check out the rendered preview of the documentation PR for this change: https://deploy-preview-45--k8s-prow.netlify.app/docs/components/core/crier/#webhook-reporterhttpsgithubcomkubernetestest-infratreemasterprowcrierreporterswebhook
/assign @cjwagner
[APPROVALNOTIFIER] This PR is APPROVED
This pull-request has been approved by: cjwagner, pierreprinetti
The full list of commands accepted by this bot can be found here.
The pull request process is described here
- ~~config/prow/cluster/OWNERS~~ [cjwagner]
- ~~prow/OWNERS~~ [cjwagner]
Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment
/hold
The Kubernetes project currently lacks enough contributors to adequately respond to all PRs.
This bot triages PRs according to the following rules:
- After 90d of inactivity,
lifecycle/staleis applied - After 30d of inactivity since
lifecycle/stalewas applied,lifecycle/rottenis applied - After 30d of inactivity since
lifecycle/rottenwas applied, the PR is closed
You can:
- Mark this PR as fresh with
/remove-lifecycle stale - Close this PR with
/close - Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
/remove-lifecycle stale
The Kubernetes project currently lacks enough contributors to adequately respond to all PRs.
This bot triages PRs according to the following rules:
- After 90d of inactivity,
lifecycle/staleis applied - After 30d of inactivity since
lifecycle/stalewas applied,lifecycle/rottenis applied - After 30d of inactivity since
lifecycle/rottenwas applied, the PR is closed
You can:
- Mark this PR as fresh with
/remove-lifecycle stale - Close this PR with
/close - Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
@pierreprinetti Is this dead?
The Kubernetes project currently lacks enough active contributors to adequately respond to all PRs.
This bot triages PRs according to the following rules:
- After 90d of inactivity,
lifecycle/staleis applied - After 30d of inactivity since
lifecycle/stalewas applied,lifecycle/rottenis applied - After 30d of inactivity since
lifecycle/rottenwas applied, the PR is closed
You can:
- Mark this PR as fresh with
/remove-lifecycle rotten - Close this PR with
/close - Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle rotten
PR needs rebase.
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.
@pierreprinetti: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:
| Test name | Commit | Details | Required | Rerun command |
|---|---|---|---|---|
| pull-test-infra-misc-image-build-test | df84bd76ce6f63e3a534234ac8a377fb7c666d0b | link | true | /test pull-test-infra-misc-image-build-test |
Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.
Hello @pierreprinetti, the Prow source code has been moved to https://github.com/kubernetes-sigs/prow on April 9, 2024. Please consider migrate this PR (by opening a new one in there). Thanks!
Hello @pierreprinetti, the Prow source code has been moved to https://github.com/kubernetes-sigs/prow on April 9, 2024. Please consider migrate this PR (by opening a new one in there). Thanks!
That's good news. Thank you for the heads-up!