sensu-go
sensu-go copied to clipboard
Prevent creation of no-expiry silences
Feature Suggestion
As a workflow practice, oftentimes creating silences without expiration (or "unlimited expiration" silences) is problematic for large teams where it's easy to forget a silence is in place. These cases may call for policies for teams to "please never create silences without expiration" but there is presently no way to enforce this in the API. It would be great if there was an opt-in configuration which prevented the creation of silences without expirations.
Possible Implementation
Due to the possibility of a breaking change if this behavior were changed by default, a mechanism to opt-in to this change could provide a way to implement it without having to wait until a major version change.
A workaround is possible here by creating a sensu check which hits the API and reports on the presence of any silences without expirations, however that creates a reactive workflow and isn't ideal.
If we do make an opt-in change to the behavior, would it be possible to include the option of setting a default silence period, so that for example when no expiry is set, then all silences fall-back to the default config?
Perhaps we could make this a non-breaking change by adding a new sensu-backend --minimum-silence-duration
flag that defaults to 0 (no expiration), but could be increased to a non-zero value.
Is it possible to make Silenced API resource validation compare resources against this setting?
@calebhailey I like that approach
@portertech has pointed out that this would need to be a cluster-level configuration setting (similar to web/v1.GlobalConfig
), and that this is getting into "policy enforcement" territory, which would require broader consideration (i.e. how would we handle policy across the entire product). TBD
Noting for posterity that this could be implemented as a Sensu secret (API Key) + check + filter + handler template (one YAML file) that does all of the following with no changes to Sensu:
- check the silence API for silences w/ expiry:0
- handler set to alert on first occurrence (e.g. Slack notification)
- "remediation" action to delete the Silence on Nth occurrence (which occurrence wouldn't be reached if the silence was updated with an expiration)
- this "delete after Nth occurrence" logic would be roughly equivalent to a "policy"
- instead of "delete after Nth occurrence" we could "update after Nth occurrence" where the update sets the silence expiration to the default value.
Another customer has requested a similar enhancement to require the use of expire_on_resolve: true
.
See: https://secure.helpscout.net/conversation/1423316647/24517?folderId=1931916
Moving to 6.9.0 in favour of shipping 6.8.0 earlier.
I do not think that a config flag or a global config resource like web/v1.GlobalConfig
is the right answer here. We should probably rework silences in 7.0 to require an expiration method, but within 6.x this feels like a rather large time commitment for a fairly small change that I suspect will be questionably acceptable.
I suggest we begin perusing one of these options instead:
- Policy management Either built in to sensu or a part of sensu-flow https://github.com/sensu/sensu-flow/issues/24 to allow operators to configure policies more flexibly than our RBAC implementation. In this example a policy could prevent non-admins from creating no-expiry silences.
- A generic sensu resource monitoring check that can be configured to find resources using a pattern (i.e.
api/v2.Silenced.Spec.Expire > 0
) that can be configured to work in conjunction with the sensu-remediation-handler or similar to remove the offending resources. - A single use sensu no-expiry resource check that can be set up to alert operators and potentially work with the sensu-remediation-handler or similar to remove the silences.
@portertech @agoddard - before he left Caleb mentioned he was thinking about implementing this as a restriction in the UI. He was supposed to add the details before he left but it slipped through the cracks. Do you know more details about this?
@fguimond I would imagine Caleb means doing validation client-side that prevents such silences from being created, but not restricting them altogether API-wise.
We believe requiring a Web UI user to explicitly set an expiration for a silence is sufficient.
@gd8 @dabria making expire a required form field is step 1, however, what can be done to improve the experience of setting the field?
If we make the field required, a better UX would definitely be to have a default value (surprise, surprise) – we'd then have to decide where this default value comes from.
An option is to have it as a value in the GlobalConfig defaulted to 0 as Caleb suggested above. However, based on comments from @portertech @c-kruse, it seems that we'd want silence expiry to be more granular (e.g. by cluster) – is this still the case if the GlobalConfig field is controlling a default value in the Web-UI form?
Ok the engineering team discussed this briefly today here in a slack thread
My understanding of the general consensus is that adding global configuration for silence expiration unnecessarily complicates the product and isn't worth it because it doesn't affect every user.
@agoddard my proposed alternative is to make it easier to find/surface silences with no expiration by adding a filter or sorting mechanism to the silences list in the Web UI. This way, teams concerned with these silences could bring them up easily to delete any problematic ones. Do you think this would solve or alleviate the original issue?