gateway-api icon indicating copy to clipboard operation
gateway-api copied to clipboard

Chaos Engineering/Fault Injection

Open howardjohn opened this issue 10 months ago • 8 comments

What would you like to be added:

An API to inject faults into routes. Typically a fault may be a delay, an HTTP response, or others.

Why this is needed:

Fault injection / Chaos engineering is a somewhat common engineering practice to intentional introduce errors into the system to simulate disaster recovery and other reliability mechanisms.

Prior art:

Slightly related to https://github.com/kubernetes-sigs/gateway-api/issues/2826

howardjohn avatar Jan 29 '25 15:01 howardjohn

How do you see this being implemented, as a testing library that we provide to implementations or just an API?

shaneutt avatar Feb 11 '25 18:02 shaneutt

No, just a HTTPRoute filter or policy. Like https://istio.io/latest/docs/reference/config/networking/virtual-service/#HTTPFaultInjection, for example. Then users or tooling can utilize that to build a holistic chaos engineering strategy

howardjohn avatar Feb 11 '25 19:02 howardjohn

No, just a HTTPRoute filter or policy. Like https://istio.io/latest/docs/reference/config/networking/virtual-service/#HTTPFaultInjection, for example. Then users or tooling can utilize that to build a holistic chaos engineering strategy

Ok, just wanted to be sure. The API specification for this sounds good, but the reason I asked about test libs is that I wouldn't be opposed to discussing having some standard "Gateway API Fault Injection" tooling. 🤔

In any case, I'm supportive of further discussing and working towards a proposal here.

shaneutt avatar Feb 12 '25 19:02 shaneutt

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot avatar May 13 '25 20:05 k8s-triage-robot

/remove-lifecycle stale

craigbox avatar Jun 06 '25 02:06 craigbox

Hi @craigbox 👋

We see you've dropped the stale on this one, did you have plans or some thoughts on how we can move the conversation forward here?

shaneutt avatar Jun 06 '25 11:06 shaneutt

Hi Shane,

No particular plans or thoughts sorry, just offering some community housekeeping.

Your comment reminded me of a similar one here. I understand the constraints that we all operate under, but I don't think the passage of time has changed the request in a way that suggests it is "stale"; rather, it just hasn't made it to the top of anyone's priority list yet. (To me, "stale" implies that time may have diminished the validity of a a bug report or reduced its impact on a user, which doesn't seem to apply here.)

My intent was to link to this issue in highlighting hat Gateway API doesn't support this feature, meaning Istio users must continue to use the legacy APIs. I think it is a better experience for someone seeing this issue to see that it remains an open request, rather than seeing it "closed" — which can imply that the feature request was not valid.

craigbox avatar Jun 11 '25 00:06 craigbox

Gotcha. In case it's helpful we don't consider closed and stale to mean invalid, or "we will never do this", but rather it's often an accurate representation of priority: if no community member can come forward and champion the issue, and there's no support to iterate on it during a release window, then it does not have priority and closed is often how we reflect this when the situation remains for long periods of time (see our documentation on the subject of bumping stale issues for the more official stance for more about our approach).

Another good example is Rate Limiting. I personally feel this is absolutely something we should have in the API, but because it sat for so long and nobody (including myself) had the priority to move it forward, closed as "will re-open when someone is specifically ready to drive this" may help remove some of the ambiguity that comes with "perpetually open without movement for many years".

Neither option is ideal of course, but compromises are always part of the process. In any case, we are glad that you continue to have interest in this feature. Perhaps you could put something on the agenda for an upcoming community meeting to talk about your interest in it, and promote it some to see if some new support can be garnered to move it forward?

shaneutt avatar Jun 11 '25 22:06 shaneutt

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot avatar Sep 09 '25 22:09 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle rotten
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-triage-robot avatar Oct 09 '25 23:10 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue with /reopen
  • Mark this issue as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

k8s-triage-robot avatar Nov 09 '25 00:11 k8s-triage-robot

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue with /reopen
  • Mark this issue as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

k8s-ci-robot avatar Nov 09 '25 00:11 k8s-ci-robot