tilt icon indicating copy to clipboard operation
tilt copied to clipboard

PodRestartErrors are too noisy/sticky

Open maiamcc opened this issue 5 years ago • 3 comments

A user saw yellow "has alert" indications on a number of pods and didn't understand why/thought there was something serious wrong--turns out they were just PodRestartError notifications. But sometimes pods just restart and it's not a big deal/is even expected.

Given that PodRestarts are sometimes not a big deal, we should consider surfacing this error in way that makes this clearer to users, otherwise they spend a lot of time trying to debug/hunting for errors.

@theothertomelliott suggests maybe PodRestartErrors expire a certain amount of time after they were first reported.

There IS a manual "dismiss" button, but it's a bit of extra effort/requires devs to understand what's going on enough to know they can dismiss these alerts. Tom notes:

I'm less enthusiastic about manual dismissal of alerts, just because we'll have a lot of different engineers using these Tiltfiles and they won't all be aware of this behavior, so could spend a lot of time trying to debug

maiamcc avatar Jun 19 '20 16:06 maiamcc

Ya, I go back and forth on this - there are some services where this warning has saved my butt (e.g., the pod was restarting because some config file was missing), and others where it's expected (e.g., a server just restarts over and over until all its deps are up)

This reminds me of some early brainstorming that @hyu did on Tilt Cloud collaboration UIs (e.g., Tom, as the team admin, could silence all pod restart warnings for everyone on the team for service X, or attach a note that they don't have to worry about it).

I like the "expiration out of a certain amount of time" in theory, but in practice, it might not work well if you're looking at the Tilt UI too infrequently.

It's also possible that the real problem here is that the Yellow alert UI is the wrong UI for this event. We should use the blue "actionbar" ui -- or some other UI to indicate a state change in the logs

@hyu what do you think?

nicks avatar Jun 19 '20 19:06 nicks

In this specific case, the restarts are entirely expected, being exactly the example you mentioned of a server restarting until all dependencies are up (in this case, a sidecar).

Having some control over what triggers a warning would suffice, since this is a less common scenario where restarts are expected pretty much every time.

What other failures might cause a warning on a resource? I've been wondering if just seeing a restart history or some sort of "uptime" bar could aid in understanding.

theothertomelliott avatar Jun 23 '20 01:06 theothertomelliott

just a note that this came up in discussion today (though note you can clear the warnings by clicking 'clear logs')

nicks avatar May 06 '22 20:05 nicks