Storj-Exporter icon indicating copy to clipboard operation
Storj-Exporter copied to clipboard

Consider limiting systemd service auto-restarts

Open anclrii opened this issue 5 years ago • 5 comments

Consider increasing RestartSec=5s or at least adding a limit like StartLimitBurst=3

anclrii avatar Jan 26 '20 10:01 anclrii

Limiting might make sense to prevent log spams but you would need to make sure that the limit isn't reached while the node is just offline for updates. A limitBurst of 3 with 5s is only 15 secs. A slow node/node under heavy load might need more than that for the new storagenode to be responding? But if interval and limit are increased to get a total timeout of e.g. 5 minutes then I'd say that makes sense. If the node isn't back online after 5 minutes, it probably won't come online.

However the question is also what does the user expect? Will he be aware that after fixing his storagenode he will have to start the exporter service too? On the other hand he'll likely remember that when his grafana dashboard doesn't show any new values.

kevinkk525 avatar Jan 26 '20 11:01 kevinkk525

There are also some other timeout options for service restarts.

We need options to get something like try to restart x times every y seconds, if failed, pause for z seconds, repeat. We can then decide on x y z seconds.

Does it not survive storagenode being down? I never had to restart it but I run it in a container so probably docker is restarting it for me when updates happen. Would it make sense to enable exporter to tolerate storagenode being unreachable and just let prometheus polls time out when api calls time out? Need to check what other exporters do in such case.

anclrii avatar Jan 26 '20 12:01 anclrii

I run it in docker too and I remember it sometimes restarting when the storagenodes got updated but if I manually restart a storagenode it does survive without a restart. However I can't tell if the exporter containers were restarted because they failed or because they are hooked to the storanode docker container which gets replaced during the update process, but not if manually restarting.

kevinkk525 avatar Jan 26 '20 15:01 kevinkk525

Started digging through other services and am seeing configurations ranging from no automatic restarts to 30s restarts. Unfortunately there does not seem to be a "best practices" so to speak for systemd services, so it will pretty much just come down to whatever we decide here will probably be good enough.

Definitely not opposed to bringing the time up to 5 minutes for auto restart. If the user catches it at exactly the time that the node updates and if the service fails as a result for some reason, it's not really the end of the world if they check the service and give it a manual restart. For all other cases the service will have restarted before they noticed it.

Overall I think the fewer additional flags and options are added the better just to keep any potential troubleshooting down the road to a minimum.

Cmdrd avatar Jan 26 '20 17:01 Cmdrd

With the last release I added some validations and exporter should now survive api connection being unavailable or missing any specific data points until they appear. This should also resolve what was discussed here. Wondering if anyone has seen more issues with restarts since the update or can we close this issue?

anclrii avatar Nov 20 '20 18:11 anclrii