prometheus-boshrelease
prometheus-boshrelease copied to clipboard
Better stemcell alerts
I was talking to @LinuxBozo about the bosh outdated stemcell alerts that I contributed here, and he pointed out it's easy to miss outdated stemcells. If the prometheus deploy that bumps the expected version fails, or gets canceled, or if an operator pauses the concourse job that deploys it, etc, we won't notice if stemcells are out of date.
I'm wondering if we can do better by adding a tiny exporter that emits the current stemcell version for a particular stemcell series. Two quick proposals:
First approach
The stemcell exporter queries http://bosh.io/api/v1/stemcells/
bosh_stemcell_info{bosh_stemcell_version="3312.29"} 1
bosh_stemcell_info{bosh_stemcell_version="3312.28"} 0
bosh_stemcell_info{bosh_stemcell_version="3312.27"} 0
...
Then we can write a query like this:
bosh_deployment_stemcell_info * on(bosh_stemcell_version) group_left bosh_stemcell_info
This gives a simple prometheus query, but could potentially miss stemcells that are from the wrong series entirely. I don't know if that's realistic, but we can handle that using a different approach:
Second approach
The stemcell exporter emits a single metric, for the expected stemcell:
bosh_stemcell_info{bosh_stemcell_version="3312.29"} 1
Then we can find outdated stemcells by listing all deployments, then subtracting deployments with the expected version:
bosh_deployment_stemcell_info unless (bosh_deployment_stemcell_info * on(bosh_stemcell_version) group_left bosh_stemcell_info)
I'm still new to prometheus, so maybe there's a simpler approach. WDYT?
cc @cnelson
There's a third approach (or complementary to the above). Instead of returning 0
or 1
we can return when the stemcell was released (in unix timestamp format). From here we can match each deployment stemcell with the release date and emit an alert if it's older than x sec/mins/days (time() - (bosh_deployment_stemcell_info * on(bosh_stemcell_version) group_left bosh_stemcell_info) > x
).
w/r/t the stemcell series, one thing we can do is to emit also labels for the major
and minor
versions, both at the bosh_io_exporter
and the bosh_exporter
. This will allow us to match the exact stemcell series and alert on minor
versions.
Oh, and we can do the same for releases
:stuck_out_tongue:.
I don't think I care how old a stemcell is in time--I only care whether it's the latest version. If no CVEs get patched over a few weeks it's fine that stemcells are a few weeks old, but if there are a few stemcell updates over a week, allowing stemcells to be a few weeks old would be bad.
I'll think about this more and hopefully write a quick bosh-io-exporter over the weekend.
Let me rephrase it. There're customers that don't live on the edge, and they update stemcells ie once a week. Having an alert that notifies them they don't have the latest stemcell they day after the stemcell has been released doesn't make any sense to them, they know they risk and accept it. Instead, having an alert that notifies them that the stemcells they're using are outdated if the latest one has been released more than 1 week ago make sense, hence that adding the timestamp makes sense. Also I find more intuitive to use a timestamp as value that not 0
or 1
.
A few things:
- It doesn't look like the bosh-hub api currently includes release timestamps, although that would be a quick patch to https://github.com/cppforlife/bosh-hub/blob/a745e1693e553b5a5a2b07cf76dfb82be8344a93/ui/stemcell/stemcell.go#L224-L234
- I think I'm still misunderstanding how you're suggesting the alert would work. As far as I can tell, stemcell releases don't have a regular cadence but happen when cves get patched. Let's say a new stemcell doesn't come out for two weeks, and I configure my alerts to go off after a week. It sounds like I would start getting spurious alerts after a week, which isn't useful--I never want to get alerted if I'm using the current stemcell. Do you mean that we would alert if the stemcell is older than some interval unless it's the latest in the series?
Thanks / sorry if I'm misunderstanding!
No, what I mean is "Alert only if a deployment is not using the latest stemcell in the series and it has been more than x time since it was released". What I want to avoid is to emit alerts if I'm not using the latest one and that one was released yesterday (because I'm upgrading stemcell once a week). I guess this is the same as using the "FOR" clause, but using an alert rule (to not have a pending alert during the FOR time, that can be pretty large).
When you say...
"Alert only if a deployment is not using the latest stemcell in the series and it has been more than x time since it was released"
...do you mean time since the outdated stemcell was released, or time since the current stemcell was released? Based on the situation you're describing, I'm guessing you mean time since current stemcell, since you're trying to avoid annoying alerts right after a stemcell comes out.
It sounds like we would have metrics like this:
bosh_stemcell_info{bosh_stemcell_version="3312.29",current="true"} 1499047468
bosh_stemcell_info{bosh_stemcell_version="3312.28",current="false"} 1499047468
bosh_stemcell_info{bosh_stemcell_version="3312.27",current="false"} 1499047468
...and a query like this:
time() - (bosh_deployment_stemcell_info * on(bosh_stemcell_version) group_left bosh_stemcell_info{current="false"}) > x
Is that what you had in mind?
Yep, something like that will work.
This issue is stale because it has been open 60 days with no activity. Comment or this will be closed in 5 days.
This issue was automatically closed because it has been stalled for 5 days with no activity.