Alert the providers team when the registry is not able to publish
Hello!
- Vote on this issue by adding a 👍 reaction
- If you want to implement this feature, comment to let us know (we'll work with you on design, scheduling, etc.)
Issue details
The providers team does not have the bandwidth to monitor docs update PRs after the provider upgrade has finished. There needs to be a mechanism to alert us when docs fail to publish.
Ideally that mechanism would distinguish between "the release is bad, and must be adjusted" vs "the registry failed to publish, please try again".
Affected area/feature
@sean1588 Can we use #docs-ops alerting system for this?
@thoward , alerting for registry publishing failures have always reported into the #docs-ops channel. I actually merged a PR to move that alerting over to #registry-ops, which is a channel I just configured a couple days ago to handle this so that it can be shared with the providers side of the house.
see: https://github.com/pulumi/registry/pull/5636
Ideally that mechanism would distinguish between "the release is bad, and must be adjusted" vs "the registry failed to publish, please try again".
@iwahbe, @guineveresaenger - currently this handles the case of registry failed to publish which could be for various reasons including the release being bad and needing to be adjusted. I need to do some thinking around the ability to add some distinguishing around this. Also let me know if you want to explore other alerting options here other than slack. This is something we have already had in place so just ported it over to another channel that is more specific to registry so these don't keep getting lost in the noise of #docs-ops.
Reiterating a message from Pulumi's internal slack:
The providers team has moved away from slack alerts to issue creation, since that tracks resolution of the issue. Our ideal case is that registry build failures create P1 issues (if possible, labeled so that we can filter only for issues caused by invalid provider builds). We can then materialize these issues in our ops dashboards.