build
build copied to clipboard
Report metrics for failed builds and plugin/core plugin executions
Currently, we only report duration metrics for successful build executions (see https://github.com/netlify/build/blob/main/packages/build/src/core/main.js#L538) which can be problematic if we want to capture, measure and alert on failures (as we've seen today for https://github.com/netlify/engineering/issues/350). We should report and capture the stage success/failure.
Would the error monitoring in Datadog and Bugsnag be more appropriate to catch those failures? Which value would we get from knowing how long failed builds last in average?
Would the error monitoring in Datadog and Bugsnag be more appropriate to catch those failures?
The idea would be to leverage the reporting already in place, we wouldn't need to create a separate metric mechanism for this. The Distribution metric reporting means we would get a count metric out the box and we would just need to rely on the correct tag to measure success/failure rates.
AFAIK bugsnag doesn't allow you to alert on "trend" changes? I think this would allow us to more easily spot inconsistencies with our function bundling for example.
We do have this build duration metric in place out of the box, which would make it easy to implement sending the build duration of failed builds.
However, I am wondering: which type of production problem would this allow us to detect, that we cannot already detect with the current Datadog metrics sent both by Netlify Build and the buildbot?
Going back to the incident at https://github.com/netlify/engineering/issues/350, I am wondering whether there is something that should be fixed instead about how we currently monitor errors during Functions bundling:
- Were the errors correctly labeled as system errors instead of user errors? If they had been, those would have appeared on Bugsnag and we would have been notified on the first build failure.
- Should we have a Datadog chart+alert about the average error rate during Functions bundling? We could do this by re-using the existing build error Datadog metric
buildbot.build.error
, but fixing thestage
tag so it includes more precise "sub-stages" like "Functions bundling".
which type of production problem would this allow us to detect, that we cannot already detect with the current Datadog metrics sent both by Netlify Build and the buildbot?
My end goal is to get error rates for @netlify/build
and its particular sub stages. We might have the ability to get error rates for build
but its a broad measurement only coming from Buildbot's execution of @netlify/build
. We can't measure functions bundling error rate for example, or drill down on correlating said error rate with particular bundlers/frameworks.
Were the errors correctly labeled as system errors instead of user errors? If they had been, those would have appeared on Bugsnag and we would have been notified on the first build failure.
If I understand it correctly, bugsnag error reporting will still require that our labeling and regex pattern matching is correct. Despite being useful, in my view, this wouldn't replace the ability to have a broader view on our error rates and the ability to correlate those same error rates with particular frameworks or tools.
Should we have a Datadog chart+alert about the average error rate during Functions bundling? We could do this by re-using the existing build error Datadog metric buildbot.build.error, but fixing the stage tag so it includes more precise "sub-stages" like "Functions bundling".
I get the feeling that we're talking about the same thing here 😂 because my understanding is, if we opt for using that metric, this will still need to live in @netlify/build
as we have no way to drill down on those same stages that live under @netlify/build
.
Should we have a Datadog chart+alert about the average error rate during Functions bundling? We could do this by re-using the existing build error Datadog metric buildbot.build.error, but fixing the stage tag so it includes more precise "sub-stages" like "Functions bundling".
Yes, I am realizing that we are talking about the same thing as well! :)
I.e. getting error rates for particular sub-stages of @netlify/build
.
From that angle, I understand now what you meant by re-using the build duration metrics, as a pragmatic way to re-use existing logic. However, it still feels odd to use a metric meant to measure duration to also measure error rate. It seems to me those are different concerns which should be split into different metrics. For example, some duration metrics are currently being computed from others under the assumption that all stages have run.
Where I was coming from with using the current buildbot metric for build failure is that it already exists. It also already has a stage
tag, although not as granular as we need. Another advantage from using that metric is that is also comes with other tags (cause
, framework
and js-workspaces
).
However, if this is hard to communicate the substage from @netlify/build
to the buildbot, your initial solution sounds good :+1:
It seems to me those are different concerns which should be split into different metrics.
Where I was coming from with using the current buildbot metric for build failure is that it already exists.
Agree with all of it 👍 Yeah when I mentioned re-using the duration it was only from a pragmatic PoV, but more than happy to use another more adequate metric. I'll edit the issue.
Alright, I've taken a look at what we have available in terms of metrics that we could reuse. We can maybe reuse buildbot.build.error
and buildbot.build.success
but I guess we would need a decent tagging strategy like @ehmicky suggested. I would be curious to hear your thoughts about this @vbrown608? Given we would be sharing the metric with buildbot. The idea would be to have something similar to what we have for buildbot.build.stage.duration
.
The key here would be differentiating between system error and user error. If we add a metric and it doesn't offer than breakdown, it would be difficult knowing rather or not we have an issue and need to act. A dimension I would see for the tags would be being able to make that distinction. Having to look at this metric for every deploy and make a judgement call is not ideal either, we should trust it will fire if there is a problem.
@JGAntunes can you clarify a bit how would that look in the context of the incident / function bundling?
AFAIK we already do some error identification in our function bundling which allows us to distinguish between a couple of different error scenarios - https://github.com/netlify/build/blob/main/packages/build/src/plugins_core/functions/error.js#L12-L24. The idea would be to have an extra tag (something like error_type
) that we could use to convey this info. @netlify/build
already has a pretty detailed set of error types - https://github.com/netlify/build/blob/main/packages/build/src/error/type.js - including different motives for plugin failures.
That being said I'm unsure how easy it will be to have such a clear breakdown for function bundling specifically, as the nature of it makes it particularly hard to distinguish between "user" and "system" errors. CC @eduardoboucas has he might have a bit more insight into this. IMO though this will be something that we'll need to refine as we go.
That being said I'm unsure how easy it will be to have such a clear breakdown for function bundling specifically, as the nature of it makes it particularly hard to distinguish between "user" and "system" errors. CC @eduardoboucas has he might have a bit more insight into this. IMO though this will be something that we'll need to refine as we go.
The current behaviour, which is defined in the link you shared, is to look for specific errors (e.g. missing dependencies or invalid package.json
) and flag them as user errors, treating everything else as system errors. I think this behaviour makes sense in principle, but we might need to revise it as we go like you said, because we might have false positives (i.e. user errors being flagged as system errors).
This issue has been automatically marked as stale because it has not had activity in 1 year. It will be closed in 14 days if no further activity occurs. Thanks!
This issue was closed because it had no activity for over 1 year.