thin-edge.io
thin-edge.io copied to clipboard
Make collectd mapper batching externally testable
Is your feature improvement request related to a problem? Please describe.
The batching of collectd mapper is controlled by two batching parameters: batching_window
(event_jitter
in code) and max message delay
(delivery_jitter
) with default values of 500ms
and 400ms
respectively. But, these parameters can not be changed from outside and that makes it extremely difficult to deterministically test the batching behaviour from outside as it is almost impossible to send multiple collectd messages from outside and make sure that they are delivered within that 500ms
delay. The fixed max_message_delay
of 400ms
makes it worse as it can lead to any one of the input messages to be dropped if they arrive just 400ms
late.
Describe the solution you'd like
The most obvious solution is to make both batching window
and max message delay
configurable via tedge config
so that we can update these values to testable limits.
The problem with this solution is the overlap that it has with the configurability provided by collectd for metrics collection interval, with which the user can control how frequently these measurements are generated and hence influence the batching behaviour. Providing additional configurations to control the batching behaviour possesses the risk of exposing too many parameters to the end user leading them to get it wrong (We've had this experience in the past, while trying to tune these parameters for our RaspberryPi slaves).
Describe alternatives you've considered
Drop the concept of max message delay
completely and batch every input that arrives within the batching window
after the receipt of the first message. With this, we can limit the configurability to just the batching window
.
The main risk with dropping this parameter is the possibility of newer batches being polluted with older measurements in some extremely slow environments.
Drop the concept of max message delay completely and batch every input that arrives within the batching window after the receipt of the first message. With this, we can limit the configurability to just the batching window.
To be more precise, one has to stop throwing away messages that are processed too late. I stressed the words processed too late because I see that as the main issue : messages are discarded only because the mapper, is late for some reason making things really to understand and test.
- Currently a message is dropped if
wall-clock-processing-time > message-source-timestamp + max-delay
The main risk with dropping this parameter is the possibility of newer batches being polluted with older measurements in some extremely slow environments.
There is no such risk if the batching time window
is relative to the message source timestamps and not the processing time. A batch will only contain messages for a given time window. However, one can have a 2 batches for the same time window, the second batch containing the laggers.
There is tricky point though. How the collectd mapper can determine that no more messages will arrive late for a given time window? This is where a max delay is useful - but used a timeout. When wall-clock-processing-time > end-of-source-time-window
and no message is received with this max delay, the batch can be closed.
If the aim is to avoid 2 batches for the same time window, the collectd mapper needs to maintain a watermark: all the batches older that watermark having already been sent.