thin-edge.io icon indicating copy to clipboard operation
thin-edge.io copied to clipboard

Make collectd mapper batching externally testable

Open albinsuresh opened this issue 1 year ago • 1 comments

Is your feature improvement request related to a problem? Please describe.

The batching of collectd mapper is controlled by two batching parameters: batching_window (event_jitter in code) and max message delay (delivery_jitter) with default values of 500ms and 400ms respectively. But, these parameters can not be changed from outside and that makes it extremely difficult to deterministically test the batching behaviour from outside as it is almost impossible to send multiple collectd messages from outside and make sure that they are delivered within that 500ms delay. The fixed max_message_delay of 400ms makes it worse as it can lead to any one of the input messages to be dropped if they arrive just 400ms late.

Describe the solution you'd like

The most obvious solution is to make both batching window and max message delay configurable via tedge config so that we can update these values to testable limits.

The problem with this solution is the overlap that it has with the configurability provided by collectd for metrics collection interval, with which the user can control how frequently these measurements are generated and hence influence the batching behaviour. Providing additional configurations to control the batching behaviour possesses the risk of exposing too many parameters to the end user leading them to get it wrong (We've had this experience in the past, while trying to tune these parameters for our RaspberryPi slaves).

Describe alternatives you've considered

Drop the concept of max message delay completely and batch every input that arrives within the batching window after the receipt of the first message. With this, we can limit the configurability to just the batching window.

The main risk with dropping this parameter is the possibility of newer batches being polluted with older measurements in some extremely slow environments.

albinsuresh avatar May 05 '23 14:05 albinsuresh

Drop the concept of max message delay completely and batch every input that arrives within the batching window after the receipt of the first message. With this, we can limit the configurability to just the batching window.

To be more precise, one has to stop throwing away messages that are processed too late. I stressed the words processed too late because I see that as the main issue : messages are discarded only because the mapper, is late for some reason making things really to understand and test.

  • Currently a message is dropped if wall-clock-processing-time > message-source-timestamp + max-delay

The main risk with dropping this parameter is the possibility of newer batches being polluted with older measurements in some extremely slow environments.

There is no such risk if the batching time window is relative to the message source timestamps and not the processing time. A batch will only contain messages for a given time window. However, one can have a 2 batches for the same time window, the second batch containing the laggers.

There is tricky point though. How the collectd mapper can determine that no more messages will arrive late for a given time window? This is where a max delay is useful - but used a timeout. When wall-clock-processing-time > end-of-source-time-window and no message is received with this max delay, the batch can be closed.

If the aim is to avoid 2 batches for the same time window, the collectd mapper needs to maintain a watermark: all the batches older that watermark having already been sent.

didier-wenzek avatar May 10 '23 10:05 didier-wenzek