gateway-api Chaos Engineering/Fault Injection

What would you like to be added:

An API to inject faults into routes. Typically a fault may be a delay, an HTTP response, or others.

Why this is needed:

Fault injection / Chaos engineering is a somewhat common engineering practice to intentional introduce errors into the system to simulate disaster recovery and other reliability mechanisms.

Prior art:

Istio

Slightly related to https://github.com/kubernetes-sigs/gateway-api/issues/2826

Jan 29 '25 15:01 howardjohn

How do you see this being implemented, as a testing library that we provide to implementations or just an API?

Feb 11 '25 18:02 shaneutt

No, just a HTTPRoute filter or policy. Like https://istio.io/latest/docs/reference/config/networking/virtual-service/#HTTPFaultInjection, for example. Then users or tooling can utilize that to build a holistic chaos engineering strategy

Feb 11 '25 19:02 howardjohn

No, just a HTTPRoute filter or policy. Like https://istio.io/latest/docs/reference/config/networking/virtual-service/#HTTPFaultInjection, for example. Then users or tooling can utilize that to build a holistic chaos engineering strategy

Ok, just wanted to be sure. The API specification for this sounds good, but the reason I asked about test libs is that I wouldn't be opposed to discussing having some standard "Gateway API Fault Injection" tooling. 🤔

In any case, I'm supportive of further discussing and working towards a proposal here.

Feb 12 '25 19:02 shaneutt

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

May 13 '25 20:05 k8s-triage-robot

/remove-lifecycle stale

Jun 06 '25 02:06 craigbox

Hi @craigbox 👋

We see you've dropped the stale on this one, did you have plans or some thoughts on how we can move the conversation forward here?

Jun 06 '25 11:06 shaneutt

Hi Shane,

No particular plans or thoughts sorry, just offering some community housekeeping.

Your comment reminded me of a similar one here. I understand the constraints that we all operate under, but I don't think the passage of time has changed the request in a way that suggests it is "stale"; rather, it just hasn't made it to the top of anyone's priority list yet. (To me, "stale" implies that time may have diminished the validity of a a bug report or reduced its impact on a user, which doesn't seem to apply here.)

My intent was to link to this issue in highlighting hat Gateway API doesn't support this feature, meaning Istio users must continue to use the legacy APIs. I think it is a better experience for someone seeing this issue to see that it remains an open request, rather than seeing it "closed" — which can imply that the feature request was not valid.

Jun 11 '25 00:06 craigbox

Gotcha. In case it's helpful we don't consider closed and stale to mean invalid, or "we will never do this", but rather it's often an accurate representation of priority: if no community member can come forward and champion the issue, and there's no support to iterate on it during a release window, then it does not have priority and closed is often how we reflect this when the situation remains for long periods of time (see our documentation on the subject of bumping stale issues for the more official stance for more about our approach).

Another good example is Rate Limiting. I personally feel this is absolutely something we should have in the API, but because it sat for so long and nobody (including myself) had the priority to move it forward, closed as "will re-open when someone is specifically ready to drive this" may help remove some of the ambiguity that comes with "perpetually open without movement for many years".

Neither option is ideal of course, but compromises are always part of the process. In any case, we are glad that you continue to have interest in this feature. Perhaps you could put something on the agenda for an upcoming community meeting to talk about your interest in it, and promote it some to see if some new support can be garnered to move it forward?

Jun 11 '25 22:06 shaneutt

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

Sep 09 '25 22:09 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle rotten
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

Oct 09 '25 23:10 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue with /reopen
Mark this issue as fresh with /remove-lifecycle rotten
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

Nov 09 '25 00:11 k8s-triage-robot

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied

After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied

After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue with /reopen

Mark this issue as fresh with /remove-lifecycle rotten

Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Nov 09 '25 00:11 k8s-ci-robot

gateway-api gateway-api copied to clipboard

Chaos Engineering/Fault Injection

gateway-api
gateway-api copied to clipboard