intelmq icon indicating copy to clipboard operation
intelmq copied to clipboard

Conditional rate-limit event generation for parsers

Open kamil-certat opened this issue 4 months ago • 5 comments

Processing of data depends on bottlenecks in the whole workflow. On the same time, the prepared events sit in queues, and some reports can produce a huge amount of events. Current ways to rate-limit bots processing are sleeps between iterations and the wait expert, that can hold a message until given condition is met (the queue size).

On the other hand, our queueing system is based primarily on Redis and entirely in-memory, what is great for performance, but leads to problematic behaviour when events are getting accumulated: Redis can fill up the memory causing OOM errors (both just because of the amount of live events as well as during performing RDS backup). In addition, an OOM kill during RDS backup leads to broken RDS files that can also fill up the disk space. Some workarounds are possible, like e.g. using KeyDB with storing data on disk (although KeyDB seems to be abandoned).

This current rate-limiting ways available in the IntelMQ fail to prevent this. They can only hold events, but are unable to prevent generating new once a message reaches a bot.

As a real example, ShadowServer has a few informative reports, like Device Identification. They can generate a huge amount of events, and the parser works significantly quicker as saving to the database. We can use wait bot to hold a report until the DB bot's queue is empty, but once the report came to the parser, we cannot stop flooding the system.

As a solution, I propose to built-in an optional simple rate-limiting, similar to how the wait bot works: after sending the generated event to the pipeline, the bot's class could check a size of a queue and eventually wait until it's free enough.

I would not like to implement it on a bot-base, but at least directly in the ParserBot class, preferably in the generic Bot or Pipeline (as it's Redis-related solution, other pipelines may not need it or require different solution). This way, the rate-limiting can be used in any bot, and slow-down production of new events if a designed bottleneck is not keeping up with the work.

An alternative would be to provide more advanced conditions, like available RAM etc., but it's in my eyes to complicated solution for a simple problem.

kamil-certat avatar Aug 26 '25 09:08 kamil-certat

I'm not sure if this is the right focus. Why not optimize the pipeline system itself, instead of working around its issues?

Also, which broker and broker version do you use? Redis/Valkey version 8 brought some major improvements, I think that's worth a try.

sebix avatar Aug 26 '25 12:08 sebix

Possible duplicate of #709

sebix avatar Aug 26 '25 12:08 sebix

To better illustrate the problem, this is the pipeline:

Image

The problem comes under some conditions, when the amount of messages exceeds the usual pattern we're prepared for. The best case is processing historically ignored Device Identification reports - they produce hundreds of thousands messages, and as the DB output bot does not work in bulk, the PostgreSQL has troubles to match the speed of the parser. It works fine under normal circumstances, but if we want to process some older reports, a few of them cause the memory to explode. And what worse, OOM kill of a Redis may later cause a restart of the parser or loading not an older Redis dump, effectively duplicating the data on the output.

I agree that the pipeline has to be optimized to the data flow. However, with no flow control we have to always adjust to the worst case - e.g. the maximum potentially produced events - leaving the risk of an edge case troubles. With some simple rules we could allow optimizing for the average case and lowering the risk of unexpected situation (e.g. DB downtime) completely crashing the system.

kamil-certat avatar Aug 28 '25 14:08 kamil-certat

Yeah, that's the same idea as in #709, isn't it? Called Flow Control or Backpressure.

sebix avatar Aug 28 '25 15:08 sebix

Yeah, pretty the same although the #709 aims to solve it in general, I'm thinking of much simpler solution when we know possible troublesome bots (so I'm overcoming the issue of which bot to stop to avoid getting stuck).

I also like the general idea of limiting the Redis memory footprint, but I'm not sure if it's the solution in every case. E.g. we can have workflows that are critical and low-risk (so we don't want to slow them down), and some less-critical workflows, that could be stopped. In addition, the Redis footprint can grow (although usually much less comparing to the event production) by things like caches (e.g. from the deduplicator).

But I'm open for all :)

kamil-certat avatar Aug 28 '25 15:08 kamil-certat