kopf icon indicating copy to clipboard operation
kopf copied to clipboard

Need advice on the way to achieve service monitoring using a daemon

Open gautamp8 opened this issue 4 years ago • 3 comments

Question

I want to monitor metrics coming from a Service object my handler has created. On certain logic - like a metric crossing a certain threshold, I want to update the number of replicas of a deployment object(another resource created by operator itself).

I read about Daemons from the documentation. I just want to know if using a daemon to make an API call to service at regular intervals to get that metric and patch the deployment accordingly is the right approach or not.

This approach looked right from this excerpt of documentation - What to use when?

FWIW - I'm trying to build a POC for a celery operator using Kopf. I wish to use flower service my operator has created to monitor the broker queue length and autoscale the worker deployments accordingly. I'm a beginner to the operator world rn, It would be good to know some advice on any specifics/gotchas I should take care of for this particular scenario.

Let me know if this question is not very informative or need changes to title/description to make it helpful to others.

Checklist

  • [x] I have read the documentation and searched there for the problem
  • [x] I have searched in the GitHub Issues for similar questions

Keywords

monitoring, service, daemon examples, daemon use-case

gautamp8 avatar Jun 26 '20 18:06 gautamp8

Quite interesting topic. Want to hear feedback as well.

eshepelyuk avatar Jun 26 '20 19:06 eshepelyuk

@brainbreaker Sorry for slightly late response for almost a week.

Yes, using either daemons or timers is the right way to do this task. They were actually designed for the purpose of monitoring something "outside", i.e. for this exact purpose.

Depending on whether the remote system can do "blocking" long-running requests (like K8s's own watching), it could be daemons. If it cannot — which I expect for the majority of the systems — timers are better, they save a little bit of RAM for you.

Timers are also better for another reason. When a remote system provides some metrics, you can put them into the resource's status via the patch kwarg (patch.status['smthng'] = value), and it will be applied to the resource once the timer function is finished. On every timer's occasion. For daemons, you have to apply that yourselves (because daemons never exit, kind of). Not a big deal, but can be convenient.

There is one aspect you should work out in advance: the error handling. If updating of a deployment fails, what should follow? In theory, timers are retried with backoff=60 (seconds; configurable; not the same as interval=…), so everything might be okay and as expected. But maybe you do not want to poll the remote system too often on every retry, so you would want to store the retrieved metric on the resource into a field .status.m, and modify the attached deployment in the on-update/on-field(field=status.m) handlers separately. But these are low-level details, actually. And over-complication based on assumptions.

Let me know if it works for you. If there are any confusing moments in using timers/daemons for this exact task, they'd better be fixed & clarified in the framework.

nolar avatar Jul 02 '20 19:07 nolar

@nolar Thank you so much for the detailed response.

Yes, using either daemons or timers is the right way to do this task. They were actually designed for the purpose of monitoring something "outside", i.e. for this exact purpose. Got it.

Depending on whether the remote system can do "blocking" long-running requests (like K8s's own watching), it could be daemons. If it cannot — which I expect for the majority of the systems — timers are better, they save a little bit of RAM for you.

Makes sense. Thanks for mentioning this.

Timers are also better for another reason. When a remote system provides some metrics, you can put them into the resource's status via the patch kwarg (patch.status['smthng'] = value), and it will be applied to the resource once the timer function is finished. On every timer's occasion. For daemons, you have to apply that yourselves (because daemons never exit, kind of). Not a big deal, but can be convenient.

There is one aspect you should work out in advance: the error handling. If updating of a deployment fails, what should follow? In theory, timers are retried with backoff=60 (seconds; configurable; not the same as interval=…), so everything might be okay and as expected. But maybe you do not want to poll the remote system too often on every retry, so you would want to store the retrieved metric on the resource into a field .status.m, and modify the attached deployment in the on-update/on-field(field=status.m) handlers separately. But these are low-level details, actually. And over-complication based on assumptions.

These are actually good insights that I couldn't think of in the first go. Thank you for sharing this. I'm going to try out Timers for this task and report back anything I'm confused about/need help with. Would be willing to improve documentation too if needed.

gautamp8 avatar Jul 04 '20 05:07 gautamp8