thin-edge.io icon indicating copy to clipboard operation
thin-edge.io copied to clipboard

Adapt monitoring of daemons to new Service Monitoring Feature in 10.14

Open mbay-ODW opened this issue 1 year ago • 3 comments

Is your feature improvement request related to a problem? Please describe. In 10.14 the possibility to monitor the current status of services was introduced.

image

Those can also contain measurements, events and alarms. Currently the health monitoring of tedge daemons is only local and not connected to those services.

image

Additional context The service is modeled as an child Additions on the device. Measurements/alarms/events are send to that child Addition. Link

There also exists an SmartRest Template 102 for the service status.

mbay-ODW avatar Aug 19 '22 10:08 mbay-ODW

great suggestion, we will review and decide when to work on this

andrej-schreiner avatar Aug 25 '22 07:08 andrej-schreiner

Created one example with python here:

https://github.com/thin-edge/thin-edge.io_examples/tree/main/watchdog-service-mapper

mbay-ODW avatar Sep 09 '22 13:09 mbay-ODW

Thank you! This will help to understand the new feature and evaluate what need to added on thin-edge.

didier-wenzek avatar Sep 09 '22 14:09 didier-wenzek

@mbay-ODW Thanks for your PoC. It worked for me. But I have a couple of questions

  1. You have used the systemd for this, We can do it without that as well right?
  2. The watch dog messages displayed on the c8y cloud are not just the most recent ones but even the old ones, Is this expected?
  3. The most recent watchdog message is displayed with a green tick, but the old status is shown in red tick, what exactly it means, any idea?

PradeepKiruvale avatar Jan 11 '23 11:01 PradeepKiruvale

I worked with @reubenmiller to understand this feature. The historical display of status happens because while sending the status the pid is sent with the status message, it will be used as a external id for the service ( managed object in digital twin in c8y) as here client.publish("c8y/s/us",f'102,{pid},{service},{name},{status}'). The service name should be unique always. So, the pid has to be replaced with device_name_service_name. This gives a unique name for a given device and stays the same even if the service restarts and gets a new pid for whatever reason.

Also, the service name must be unique across the devices, so its better to use the device_name_service_name naming scheme while sending the status to the c8y cloud.

PradeepKiruvale avatar Jan 16 '23 16:01 PradeepKiruvale

There is an error in the documentation where to order was of the SmartRest template was mixed up. I will update the service according to your suggestions.

mbay-ODW avatar Jan 17 '23 09:01 mbay-ODW

As part of this task, the proposed solution here is to implement a separate monitor service that checks the health of the thin-edge services and reports the status on behalf of these services to the c8y could. The frequency at which the status needs to be sent must be configured and can also be based on other constraints like the state change etc.

PradeepKiruvale avatar Jan 17 '23 09:01 PradeepKiruvale

Yes, I agree. Ideally via top or collectd. I will check that and make a proposal.

mbay-ODW avatar Jan 17 '23 09:01 mbay-ODW

A separate monitor service that checks the health of the thin-edge services and reports the status on behalf of these services to the c8y could.

Why yet another service? All the tedge daemons already feature a health check. This can be leverage by the c8y mapper to send service status to c8y. A specific last will message might be required to notify that the c8y mapper itself is down.

didier-wenzek avatar Jan 24 '23 08:01 didier-wenzek

A separate monitor service that checks the health of the thin-edge services and reports the status on behalf of these services to the c8y could.

Why yet another service? All the tedge daemons already feature a health check. This can be leverage by the c8y mapper to send service status to c8y. A specific last will message might be required to notify that the c8y mapper itself is down.

Yes, that can be done. There are two reasons for proposing a separate service for the monitoring itself.

  • This is to be aligned with our single plugin per operation. I feel the monitoring services is an operation as well right?
  • It's always better to have a separate service/daemon that watches the other services. As I understand its a generic watchdog design principle

PradeepKiruvale avatar Jan 24 '23 09:01 PradeepKiruvale

A separate monitor service that checks the health of the thin-edge services and reports the status on behalf of these services to the c8y could.

Why yet another service? All the tedge daemons already feature a health check. This can be leverage by the c8y mapper to send service status to c8y. A specific last will message might be required to notify that the c8y mapper itself is down.

Yes, that can be done. There are two reasons for proposing a separate service for the monitoring itself.

  • This is to be aligned with our single plugin per operation. I feel the monitoring services is an operation as well right?
  • It's always better to have a separate service/daemon that watches the other services. As I understand its a generic watchdog design principle

I'm not against the MQTT last will being used if it can contain all of the relevant information that is required in the service down message.

If it is not implemented using the MQTT last will then the watchdog service feels like a good fit rather than creating a new service. The only drawback is that the watchdog service is heavily coupled with systemd so it can't be used for non-systemd environments...so that would be another benefit of using the MQTT last will in each service.

reubenmiller avatar Jan 24 '23 09:01 reubenmiller

A separate monitor service that checks the health of the thin-edge services and reports the status on behalf of these services to the c8y could.

Why yet another service? All the tedge daemons already feature a health check. This can be leverage by the c8y mapper to send service status to c8y. A specific last will message might be required to notify that the c8y mapper itself is down.

Yes, that can be done. There are two reasons for proposing a separate service for the monitoring itself.

  • This is to be aligned with our single plugin per operation. I feel the monitoring services is an operation as well right?
  • It's always better to have a separate service/daemon that watches the other services. As I understand its a generic watchdog design principle

I'm not against the MQTT last will being used if it can contain all of the relevant information that is required in the service down message.

If it is not implemented using the MQTT last will then the watchdog service feels like a good fit rather than creating a new service. The only drawback is that the watchdog service is heavily coupled with systemd so it can't be used for non-systemd environments...so that would be another benefit of using the MQTT last will in each service.

The health-check mechanism can work without systemd. It just needs to send a request on c8y/health-check topic whenever it wants to know the status of tedge service(daemon)and get back a reply from a specifictedge daemon(irrespective of how it started) on a specifiedhealth-check response` topic.

PradeepKiruvale avatar Jan 24 '23 09:01 PradeepKiruvale

This is to be aligned with our single plugin per operation. I feel the monitoring services is an operation as well right?

Mid term we will reconsider the approach "a plugin per operation". This has been done so far to ease developments, removing conflicts on the mapper, but this introduces too many issues related to packaging / reliability / resource consumption.

It's always better to have a separate service/daemon that watches the other services. As I understand its a generic watchdog design principle

It's a good principle, indeed. What I propose is to use the c8y mapper to monitor all the other tedge daemons. One needs a different mechanism to monitor the mapper itself. I propose to use last will messages for that: i.e. let the MQTT broker monitor the c8y mapper.

I'm not against the MQTT last will being used if it can contain all of the relevant information that is required in the service down message.

The last will message would be used only for the c8y mapper itself. The regular thin-edge health check mechanism can be used for all the other thin-edge daemons.

If it is not implemented using the MQTT last will then the watchdog service feels like a good fit rather than creating a new service. The only drawback is that the watchdog service is heavily coupled with systemd so it can't be used for non-systemd environments...so that would be another benefit of using the MQTT last will in each service.

I would really avoid to tie this feature to systemd.

The health-check mechanism can work without systemd. It just needs to send a request on c8y/health-check topic whenever it wants to know the status of tedge service(daemon)and get back a reply from a specifictedge daemon(irrespective of how it started) on a specifiedhealth-check response` topic.

I don't understand the data flow you are describing. Which component triggers the check?

didier-wenzek avatar Jan 24 '23 09:01 didier-wenzek

This is to be aligned with our single plugin per operation. I feel the monitoring services is an operation as well right?

Mid term we will reconsider the approach "a plugin per operation". This has been done so far to ease developments, removing conflicts on the mapper, but this introduces too many issues related to packaging / reliability / resource consumption.

It's always better to have a separate service/daemon that watches the other services. As I understand its a generic watchdog design principle

It's a good principle, indeed. What I propose is to use the c8y mapper to monitor all the other tedge daemons. One needs a different mechanism to monitor the mapper itself. I propose to use last will messages for that: i.e. let the MQTT broker monitor the c8y mapper.

I'm not against the MQTT last will being used if it can contain all of the relevant information that is required in the service down message.

The last will message would be used only for the c8y mapper itself. The regular thin-edge health check mechanism can be used for all the other thin-edge daemons.

If it is not implemented using the MQTT last will then the watchdog service feels like a good fit rather than creating a new service. The only drawback is that the watchdog service is heavily coupled with systemd so it can't be used for non-systemd environments...so that would be another benefit of using the MQTT last will in each service.

I would really avoid to tie this feature to systemd.

The health-check mechanism can work without systemd. It just needs to send a request on c8y/health-check topic whenever it wants to know the status of tedge service(daemon)and get back a reply from a specific edge daemon(irrespective of how it started) on a specified health-check response` topic.

I don't understand the data flow you are describing. Which component triggers the check?

The trigger could be from the tedge-mapper-c8y if we use this to monitor the tedge-services. This will regularly (frequency can be made configurable) check the health of the services and then pushes the status to the cloud.

PradeepKiruvale avatar Jan 24 '23 09:01 PradeepKiruvale

I changed the example to:

client.publish("c8y/s/us",f'102,{device_id}_{name},{service},{name},{status}')

The device_id is extracted via:

"tedge config get device.id"

Thus when the service restarts it will not create a new service.

mbay-ODW avatar Jan 24 '23 10:01 mbay-ODW

Below are my proposals to implement the feature,

Proposal 1: Using tedge-mapper-c8y to update the status of the services

Use the tedge-mapper-c8y to monitor the services locally and push/forward the status of these services to c8y cloud. Each service will update the status on /tedge/health/<service-name> topic, take this status and convert to a service monitoring message and forward it to c8y.

If a tedge service is up and running it will publish the status as up and if it goes down it will publish down on the specific MQTT topic that will be converted to a smart rest status message and will be forwarded to the c8y.

If the tedge-mapper-c8y itself goes down then it will have to update the status to c8y, for this, a last will message (LWM) has to be registered on the c8y/s/us topic with 102 smart rest template message, so that the broker forwards this LWM when the tedge-mapper-c8y service goes down.

Note: Here the tedge-mapper-c8y has to be registered with two LWM one for sending out the down health status on tedge/health/tedge-mapper-c8y topic and one to send the status to c8y on c8y/s/us.

Pros:

  • A single service will be pushing out the service status to c8y cloud. This will act as a watchdog for the other services.
  • There will be only one LWM message will be registered to send its own status when it goes down.

Cons:

  • When it goes down it can not update the status of the other services to the c8y cloud.
  • It has to process the health status messages from other services and forward them to c8y cloud, this is a over burden for the mapper, which is already doing many things.

Proposal 2: Using individual service itself to send the status to the c8y cloud

Here each and every thin-edge c8y service will send the status by itself to the cloud. As it publishes the status on tedge/health/<service-name> topic it can also send the status to c8y on c8y/s/us with the smart rest template 102.

When goes down these services will send out the down health status on tedge/health/<service-name> topic. In a similar way, a LWM has to be registered on c8y/s/us topic to send the down status to the c8y cloud when the service goes down.

Pros: Individual service can send its status by itself to c8y cloud, so not dependent on another service.

Cons: Every service has to register an extra LWM message for sending the status message to the c8y cloud

PradeepKiruvale avatar Feb 01 '23 09:02 PradeepKiruvale

Actually I would propose a mix of the two proposals. Also, the official term is "Last Will and Testament" so the abbreviation is "LTW" message (reference).

Proposal 3: Use LWT Message for each service, but mapper still does the local message to cloud translation

  • Each service should publish on the mqtt topic tedge/health/<service-name>, and register a LWT message (so that when the service's mqtt client get's disconnected the service "down" signal will be sent. This is advised because you can't trust the service to send the message when it is going down (e.g. a SIGKILL will immediately kill a process without warning).
  • The cloud mapper (e.g. tedge-mapper-c8y) will subscribe to the health messages and translate them to the appropriate cloud provider format. This ensure that each service does not need to know about which cloud provider is connected.
  • The cloud mapper registers its own cloud-specific LWT message (which goes via the mosquitto bridge)

Pros:

  • LWT message is sent even if the service's process is not running (as long as it did not die before registering the LWT message). It does not rely on a service to do the right thing.
  • Only the cloud mapper needs to know cloud specific formats (which it already knows because it is the mapper ;))

Cons:

  • If the cloud mapper service goes down, only its own LWT "service down" message will be sent to the cloud (not all of the services it is responsible for mapping). But this is ok, as it is reasonable that if a central services goes down, you can't expect all of the other services to be fully functional.

reubenmiller avatar Feb 01 '23 16:02 reubenmiller

I think proposal 3 is a rephrasing of proposal 1. Indeed, the services are already sending last will messages to tedge/health/xxx. What is not clear: is C8Y expecting regular health messages? Are initial up message and terminal down messages enough?

Proposal 2 is a no go as all services will have then a dependency to c8y.

;-) the MQTT specifications simply use the term "Will Message"

didier-wenzek avatar Feb 01 '23 21:02 didier-wenzek

The frequency at which the status needs to be updated is may be left to the user. By default may be we can just update only when the status changes from up to down or vice versa.

In proposal 2 except for tedge-agent all the other services are specific to c8y right? So, I feel we can still consider this one.

PradeepKiruvale avatar Feb 02 '23 07:02 PradeepKiruvale

In proposal 2 except for tedge-agent all the other services are specific to c8y right? So, I feel we can still consider this one.

I disagree. There are several reasons against the idea to have each service sending its own status to Cumulocity:

  1. If a service is unrelated to Cumulocity then it must have no dependencies to it. Having currently a single service truly independent changes nothing.
  2. What if the support for another monitoring system is added? The services would have to tell their status to thin-edge, to Cumulocity and to this third monitoring system.
  3. One benefit of thin-edge is to simplify the development of specific edge services - by wrapping a local process with all the details required to manage this service from the cloud. A plugin should only provides local health checks - thin-edge mappers translating these to specific cloud systems.

didier-wenzek avatar Feb 02 '23 09:02 didier-wenzek

Query on the feature requirement itself: The expectation with this feature is to monitor tedge daemons only ? Or monitor any service on the system? The example in the original ticket shows dockerd, hence double checking. If external services also need to be monitored, who's responsible for that monitoring?

If the customer is only interested in monitoring tedge daemons, then I'd also be in favour of proposal 1 or 3 (both sounded the same, unless I missed some subtle variations) to avoid overloading cloud specific responsibilities to all the daemons. Even though most of the existing daemons are c8y specific, there are exceptions like the collectd-mapper and tedge-agent as well, for which we need a different solution anyway.

albinsuresh avatar Feb 03 '23 07:02 albinsuresh

Query on the feature requirement itself: The expectation with this feature is to monitor tedge daemons only ? Or monitor any service on the system? The example in the original ticket shows dockerd, hence double checking. If external services also need to be monitored, who's responsible for that monitoring?

If the customer is only interested in monitoring tedge daemons, then I'd also be in favour of proposal 1 or 3 (both sounded the same, unless I missed some subtle variations) to avoid overloading cloud specific responsibilities to all the daemons. Even though most of the existing daemons are c8y specific, there are exceptions like the collectd-mapper and tedge-agent as well, for which we need a different solution anyway.

The mapper should be capable of translating any service status messages published to tedge/health/<service_name>. Though it is not in scope to actively monitor any service running on the device. But it enables users to write their own specific listener which knows how to get the service status (e.g. it can run docker ps to check which containers are running), then publish to the topic. But it would also be up to the listener to then also respond to the "refresh health check status" topic, tedge/health-check/<service_name>.

@PradeepKiruvale Though I think we should extend the topic structure to also support child devices, e.g. support the following topic structures for clients to publish the service status to:

tedge/health/<service_name>
tedge/health/<service_name>/<child-id>

reubenmiller avatar Feb 03 '23 08:02 reubenmiller

I think we should extend the topic structure to also support child devices, e.g. support the following topic structures for clients to publish the service status to:

tedge/health/<service_name>
tedge/health/<service_name>/<child-id>

This a good point. However, we need to define general naming scheme for topic names. For measurements and alarms the logic is already different.

didier-wenzek avatar Feb 03 '23 08:02 didier-wenzek

I can confirm that the service monitoring works for both the thin-edge and child device. One has to use c8y/s/us or c8y/s/us/<external-child-id> topic to send the service status updates of thin-edge and child devices respectively.

PradeepKiruvale avatar Feb 08 '23 08:02 PradeepKiruvale

I think we should extend the topic structure to also support child devices, e.g. support the following topic structures for clients to publish the service status to:

tedge/health/<service_name>
tedge/health/<service_name>/<child-id>

This is a good point. However, we need to define a general naming scheme for topic names. For measurements and alarms, the logic is already different.

Considering the security pov, it's better to re-arrange the MQTT topics to apply the ACLs easily. So, I propose the below schema for the topics

tedge/<feature-type> for the thin-edge device
tedge/<feature-type>/<child-id> for the child devices

The supported features are measurement, health, alarm, event, configuration, etc

If we follow this schema for example, for an MQTT client that just requires access to the tedge/health/# topics it's easily provided just for that. Or if an MQTT client needs access to a specific child device topic one can provide access easily just to tedge/<child-id>/#

PradeepKiruvale avatar Feb 09 '23 06:02 PradeepKiruvale

I think we should extend the topic structure to also support child devices, e.g. support the following topic structures for clients to publish the service status to:

tedge/health/<service_name>
tedge/health/<service_name>/<child-id>

This is a good point. However, we need to define a general naming scheme for topic names. For measurements and alarms, the logic is already different.

Considering the security pov, it's better to re-arrange the MQTT topics to apply the ACLs easily. So, I propose the below schema for the topics

tedge/<feature-type> for the thin-edge device
tedge/<feature-type>/<child-id> for the child devices

The supported features are measurement, health, alarm, event, configuration, etc

If we follow this schema for example, for an MQTT client that just requires access to the tedge/health/# topics it's easily provided just for that. Or if an MQTT client needs access to a specific child device topic one can provide access easily just to tedge/<child-id>/#

Yes, we need to revisit the structure of all topics (mainly to focus on how it can better facilitate writing ACL rules). But I would cover this work in a separate ticket.

reubenmiller avatar Feb 09 '23 06:02 reubenmiller

The final proposal is:

  • Use the tedge-mapper-c8y for translating and sending the service's health status to the c8y cloud

  • The frequency is only on the state change, i.e from up to down or down to up. It can also be configurable (maybe later).

  • The health status message for the primary device has to be sent on tedge/health/<service-name>

  • The health status message for the child devices has to be sent on tedge/health/<child-id>/<service-name>

  • The services name here is either a thin-edge service or any other service that needs to be monitored on the primary/child device.

  • The health message must be sent in this {“pid”:”process id of the service”, "type":"service type", “time”: “time of the status”, “status”:”up/down”} format

PradeepKiruvale avatar Feb 09 '23 16:02 PradeepKiruvale