thin-edge.io
thin-edge.io copied to clipboard
Adapt monitoring of daemons to new Service Monitoring Feature in 10.14
Is your feature improvement request related to a problem? Please describe. In 10.14 the possibility to monitor the current status of services was introduced.
Those can also contain measurements, events and alarms. Currently the health monitoring of tedge daemons is only local and not connected to those services.
Additional context The service is modeled as an child Additions on the device. Measurements/alarms/events are send to that child Addition. Link
There also exists an SmartRest Template 102 for the service status.
great suggestion, we will review and decide when to work on this
Created one example with python here:
https://github.com/thin-edge/thin-edge.io_examples/tree/main/watchdog-service-mapper
Thank you! This will help to understand the new feature and evaluate what need to added on thin-edge.
@mbay-ODW Thanks for your PoC. It worked for me. But I have a couple of questions
- You have used the systemd for this, We can do it without that as well right?
- The
watch dog
messages displayed on the c8y cloud are not just the most recent ones but even the old ones, Is this expected? - The most recent watchdog message is displayed with a
green
tick, but the old status is shown inred
tick, what exactly it means, any idea?
I worked with @reubenmiller to understand this feature. The historical display of status happens because while sending the status the pid
is sent with the status message, it will be used as a external id for the service ( managed object in digital twin in c8y) as here client.publish("c8y/s/us",f'102,{pid},{service},{name},{status}')
. The service name should be unique
always. So, the pid
has to be replaced with device_name_service_name
. This gives a unique name for a given device and stays the same even if the service restarts and gets a new pid for whatever reason.
Also, the service
name must be unique across the devices, so its better to use the device_name_service_name
naming scheme while sending the status to the c8y cloud.
There is an error in the documentation where to order was of the SmartRest template was mixed up. I will update the service according to your suggestions.
As part of this task, the proposed solution here is to implement a separate monitor service
that checks the health of the thin-edge
services and reports the status on behalf of these services
to the c8y could. The frequency at which the status
needs to be sent must be configured and can also be based on other constraints like the state change etc.
Yes, I agree. Ideally via top or collectd. I will check that and make a proposal.
A separate monitor service that checks the health of the thin-edge services and reports the status on behalf of these services to the c8y could.
Why yet another service? All the tedge daemons already feature a health check. This can be leverage by the c8y mapper to send service status to c8y. A specific last will message might be required to notify that the c8y mapper itself is down.
A separate monitor service that checks the health of the thin-edge services and reports the status on behalf of these services to the c8y could.
Why yet another service? All the tedge daemons already feature a health check. This can be leverage by the c8y mapper to send service status to c8y. A specific last will message might be required to notify that the c8y mapper itself is down.
Yes, that can be done. There are two reasons for proposing a separate service for the monitoring itself.
- This is to be aligned with our single plugin per
operation
. I feel themonitoring services
is anoperation
as well right? - It's always better to have a separate service/daemon that watches the other services. As I understand its a generic
watchdog
design principle
A separate monitor service that checks the health of the thin-edge services and reports the status on behalf of these services to the c8y could.
Why yet another service? All the tedge daemons already feature a health check. This can be leverage by the c8y mapper to send service status to c8y. A specific last will message might be required to notify that the c8y mapper itself is down.
Yes, that can be done. There are two reasons for proposing a separate service for the monitoring itself.
- This is to be aligned with our single plugin per
operation
. I feel themonitoring services
is anoperation
as well right?- It's always better to have a separate service/daemon that watches the other services. As I understand its a generic
watchdog
design principle
I'm not against the MQTT last will being used if it can contain all of the relevant information that is required in the service down
message.
If it is not implemented using the MQTT last will then the watchdog
service feels like a good fit rather than creating a new service. The only drawback is that the watchdog service is heavily coupled with systemd
so it can't be used for non-systemd environments...so that would be another benefit of using the MQTT last will in each service.
A separate monitor service that checks the health of the thin-edge services and reports the status on behalf of these services to the c8y could.
Why yet another service? All the tedge daemons already feature a health check. This can be leverage by the c8y mapper to send service status to c8y. A specific last will message might be required to notify that the c8y mapper itself is down.
Yes, that can be done. There are two reasons for proposing a separate service for the monitoring itself.
- This is to be aligned with our single plugin per
operation
. I feel themonitoring services
is anoperation
as well right?- It's always better to have a separate service/daemon that watches the other services. As I understand its a generic
watchdog
design principleI'm not against the MQTT last will being used if it can contain all of the relevant information that is required in the
service down
message.If it is not implemented using the MQTT last will then the
watchdog
service feels like a good fit rather than creating a new service. The only drawback is that the watchdog service is heavily coupled withsystemd
so it can't be used for non-systemd environments...so that would be another benefit of using the MQTT last will in each service.
The health-check
mechanism can work without systemd. It just needs to send a request on c8y/health-check topic whenever it wants to know the status of
tedge service(daemon)and get back a reply from a specific
tedge daemon(irrespective of how it started) on a specified
health-check response` topic.
This is to be aligned with our single plugin per operation. I feel the monitoring services is an operation as well right?
Mid term we will reconsider the approach "a plugin per operation". This has been done so far to ease developments, removing conflicts on the mapper, but this introduces too many issues related to packaging / reliability / resource consumption.
It's always better to have a separate service/daemon that watches the other services. As I understand its a generic watchdog design principle
It's a good principle, indeed. What I propose is to use the c8y mapper to monitor all the other tedge daemons. One needs a different mechanism to monitor the mapper itself. I propose to use last will messages for that: i.e. let the MQTT broker monitor the c8y mapper.
I'm not against the MQTT last will being used if it can contain all of the relevant information that is required in the service down message.
The last will message would be used only for the c8y mapper itself. The regular thin-edge health check mechanism can be used for all the other thin-edge daemons.
If it is not implemented using the MQTT last will then the watchdog service feels like a good fit rather than creating a new service. The only drawback is that the watchdog service is heavily coupled with systemd so it can't be used for non-systemd environments...so that would be another benefit of using the MQTT last will in each service.
I would really avoid to tie this feature to systemd
.
The health-check mechanism can work without systemd. It just needs to send a request on c8y/health-check topic whenever it wants to know the status of tedge service(daemon)and get back a reply from a specifictedge daemon(irrespective of how it started) on a specifiedhealth-check response` topic.
I don't understand the data flow you are describing. Which component triggers the check?
This is to be aligned with our single plugin per operation. I feel the monitoring services is an operation as well right?
Mid term we will reconsider the approach "a plugin per operation". This has been done so far to ease developments, removing conflicts on the mapper, but this introduces too many issues related to packaging / reliability / resource consumption.
It's always better to have a separate service/daemon that watches the other services. As I understand its a generic watchdog design principle
It's a good principle, indeed. What I propose is to use the c8y mapper to monitor all the other tedge daemons. One needs a different mechanism to monitor the mapper itself. I propose to use last will messages for that: i.e. let the MQTT broker monitor the c8y mapper.
I'm not against the MQTT last will being used if it can contain all of the relevant information that is required in the service down message.
The last will message would be used only for the c8y mapper itself. The regular thin-edge health check mechanism can be used for all the other thin-edge daemons.
If it is not implemented using the MQTT last will then the watchdog service feels like a good fit rather than creating a new service. The only drawback is that the watchdog service is heavily coupled with systemd so it can't be used for non-systemd environments...so that would be another benefit of using the MQTT last will in each service.
I would really avoid to tie this feature to
systemd
.The health-check mechanism can work without systemd. It just needs to send a request on c8y/health-check topic whenever it wants to know the status of tedge service(daemon)and get back a reply from a specific edge daemon(irrespective of how it started) on a specified health-check response` topic.
I don't understand the data flow you are describing. Which component triggers the check?
The trigger could be from the tedge-mapper-c8y
if we use this to monitor the tedge-services. This will regularly (frequency can be made configurable) check the health of the services and then pushes the status to the cloud.
I changed the example to:
client.publish("c8y/s/us",f'102,{device_id}_{name},{service},{name},{status}')
The device_id is extracted via:
"tedge config get device.id"
Thus when the service restarts it will not create a new service.
Below are my proposals to implement the feature,
Proposal 1: Using tedge-mapper-c8y to update the status of the services
Use the tedge-mapper-c8y
to monitor the services locally and push/forward the status of these services to c8y cloud.
Each service will update the status on /tedge/health/<service-name>
topic, take this status and convert to a service monitoring message and forward it to c8y.
If a tedge service is up and running it will publish the status as up
and if it goes down it will publish down
on the specific MQTT topic that will be converted to a smart rest status message and will be forwarded to the c8y.
If the tedge-mapper-c8y
itself goes down then it will have to update the status to c8y, for this, a last will message
(LWM) has to be registered on the c8y/s/us
topic with 102
smart rest template message, so that the broker forwards this LWM when the tedge-mapper-c8y
service goes down.
Note: Here the tedge-mapper-c8y
has to be registered with two LWM one for sending out the down
health status on tedge/health/tedge-mapper-c8y
topic and one to send the status to c8y on c8y/s/us
.
Pros:
- A single service will be pushing out the service status to c8y cloud. This will act as a watchdog for the other services.
- There will be only one LWM message will be registered to send its own status when it goes down.
Cons:
- When it goes down it can not update the status of the other services to the c8y cloud.
- It has to process the health status messages from other services and forward them to c8y cloud, this is a over burden for the mapper, which is already doing many things.
Proposal 2: Using individual service itself to send the status to the c8y cloud
Here each and every thin-edge c8y
service will send the status by itself to the cloud. As it publishes the status on tedge/health/<service-name>
topic it can also send the status to c8y on c8y/s/us
with the smart rest template 102
.
When goes down these services will send out the down
health status on tedge/health/<service-name>
topic.
In a similar way, a LWM has to be registered on c8y/s/us
topic to send the down
status to the c8y cloud when the service goes down.
Pros: Individual service can send its status by itself to c8y cloud, so not dependent on another service.
Cons: Every service has to register an extra LWM message for sending the status message to the c8y cloud
Actually I would propose a mix of the two proposals. Also, the official term is "Last Will and Testament" so the abbreviation is "LTW" message (reference).
Proposal 3: Use LWT Message for each service, but mapper still does the local message to cloud translation
- Each service should publish on the mqtt topic
tedge/health/<service-name>
, and register a LWT message (so that when the service's mqtt client get's disconnected the service "down" signal will be sent. This is advised because you can't trust the service to send the message when it is going down (e.g. aSIGKILL
will immediately kill a process without warning). - The cloud mapper (e.g.
tedge-mapper-c8y
) will subscribe to the health messages and translate them to the appropriate cloud provider format. This ensure that each service does not need to know about which cloud provider is connected. - The cloud mapper registers its own cloud-specific LWT message (which goes via the mosquitto bridge)
Pros:
- LWT message is sent even if the service's process is not running (as long as it did not die before registering the LWT message). It does not rely on a service to do the right thing.
- Only the cloud mapper needs to know cloud specific formats (which it already knows because it is the mapper ;))
Cons:
- If the cloud mapper service goes down, only its own LWT "service down" message will be sent to the cloud (not all of the services it is responsible for mapping). But this is ok, as it is reasonable that if a central services goes down, you can't expect all of the other services to be fully functional.
I think proposal 3 is a rephrasing of proposal 1. Indeed, the services are already sending last will messages to tedge/health/xxx
. What is not clear: is C8Y expecting regular health messages? Are initial up message and terminal down messages enough?
Proposal 2 is a no go as all services will have then a dependency to c8y.
;-) the MQTT specifications simply use the term "Will Message"
The frequency at which the status needs to be updated is may be left to the user. By default may be we can just update only when the status changes from up to down or vice versa
.
In proposal 2 except for tedge-agent
all the other services are specific to c8y
right? So, I feel we can still consider this one.
In proposal 2 except for tedge-agent all the other services are specific to c8y right? So, I feel we can still consider this one.
I disagree. There are several reasons against the idea to have each service sending its own status to Cumulocity:
- If a service is unrelated to Cumulocity then it must have no dependencies to it. Having currently a single service truly independent changes nothing.
- What if the support for another monitoring system is added? The services would have to tell their status to thin-edge, to Cumulocity and to this third monitoring system.
- One benefit of thin-edge is to simplify the development of specific edge services - by wrapping a local process with all the details required to manage this service from the cloud. A plugin should only provides local health checks - thin-edge mappers translating these to specific cloud systems.
Query on the feature requirement itself: The expectation with this feature is to monitor tedge daemons only ? Or monitor any service on the system? The example in the original ticket shows dockerd
, hence double checking. If external services also need to be monitored, who's responsible for that monitoring?
If the customer is only interested in monitoring tedge daemons, then I'd also be in favour of proposal 1 or 3 (both sounded the same, unless I missed some subtle variations) to avoid overloading cloud specific responsibilities to all the daemons. Even though most of the existing daemons are c8y specific, there are exceptions like the collectd-mapper
and tedge-agent
as well, for which we need a different solution anyway.
Query on the feature requirement itself: The expectation with this feature is to monitor tedge daemons only ? Or monitor any service on the system? The example in the original ticket shows
dockerd
, hence double checking. If external services also need to be monitored, who's responsible for that monitoring?If the customer is only interested in monitoring tedge daemons, then I'd also be in favour of proposal 1 or 3 (both sounded the same, unless I missed some subtle variations) to avoid overloading cloud specific responsibilities to all the daemons. Even though most of the existing daemons are c8y specific, there are exceptions like the
collectd-mapper
andtedge-agent
as well, for which we need a different solution anyway.
The mapper should be capable of translating any service status messages published to tedge/health/<service_name>
. Though it is not in scope to actively monitor any service running on the device. But it enables users to write their own specific listener which knows how to get the service status (e.g. it can run docker ps
to check which containers are running), then publish to the topic. But it would also be up to the listener to then also respond to the "refresh health check status" topic, tedge/health-check/<service_name>
.
@PradeepKiruvale Though I think we should extend the topic structure to also support child devices, e.g. support the following topic structures for clients to publish the service status to:
tedge/health/<service_name>
tedge/health/<service_name>/<child-id>
I think we should extend the topic structure to also support child devices, e.g. support the following topic structures for clients to publish the service status to:
tedge/health/<service_name> tedge/health/<service_name>/<child-id>
This a good point. However, we need to define general naming scheme for topic names. For measurements and alarms the logic is already different.
I can confirm that the service monitoring works for both the thin-edge
and child
device. One has to use c8y/s/us
or c8y/s/us/<external-child-id>
topic to send the service status updates of thin-edge
and child
devices respectively.
I think we should extend the topic structure to also support child devices, e.g. support the following topic structures for clients to publish the service status to:
tedge/health/<service_name> tedge/health/<service_name>/<child-id>
This is a good point. However, we need to define a general naming scheme for topic names. For measurements and alarms, the logic is already different.
Considering the security
pov, it's better to re-arrange the MQTT topics to apply the ACLs easily. So, I propose the below schema for the topics
tedge/<feature-type> for the thin-edge device
tedge/<feature-type>/<child-id> for the child devices
The supported features are measurement, health, alarm, event, configuration, etc
If we follow this schema for example, for an MQTT client that just requires access to the tedge/health/#
topics it's easily provided just for that.
Or if an MQTT client needs access to a specific child
device topic one can provide access easily just to tedge/<child-id>/#
I think we should extend the topic structure to also support child devices, e.g. support the following topic structures for clients to publish the service status to:
tedge/health/<service_name> tedge/health/<service_name>/<child-id>
This is a good point. However, we need to define a general naming scheme for topic names. For measurements and alarms, the logic is already different.
Considering the
security
pov, it's better to re-arrange the MQTT topics to apply the ACLs easily. So, I propose the below schema for the topicstedge/<feature-type> for the thin-edge device tedge/<feature-type>/<child-id> for the child devices The supported features are measurement, health, alarm, event, configuration, etc
If we follow this schema for example, for an MQTT client that just requires access to the
tedge/health/#
topics it's easily provided just for that. Or if an MQTT client needs access to a specificchild
device topic one can provide access easily just totedge/<child-id>/#
Yes, we need to revisit the structure of all topics (mainly to focus on how it can better facilitate writing ACL rules). But I would cover this work in a separate ticket.
The final proposal is:
-
Use the
tedge-mapper-c8y
for translating and sending the service's health status to the c8y cloud -
The frequency is only on the state change, i.e from
up to down
ordown to up
. It can also be configurable (maybe later). -
The health status message for the primary device has to be sent on
tedge/health/<service-name>
-
The health status message for the child devices has to be sent on
tedge/health/<child-id>/<service-name>
-
The services name here is either a
thin-edge
service or anyother service
that needs to be monitored on the primary/child device. -
The health message must be sent in this
{“pid”:”process id of the service”, "type":"service type", “time”: “time of the status”, “status”:”up/down”}
format