intelmq icon indicating copy to clipboard operation
intelmq copied to clipboard

[smtp_batch] Additional field for grouping

Open Lukas-Heindl opened this issue 8 months ago • 16 comments

Hi,

I have a usecase where I'd like to send batched E-Mails to someone with collected events for which I want to use the smtp_batch output bot. Then in this case it would be nice to break down the events for one contact further into multiple E-Mails. Would it be possible to add some option such that the group-by is not only done by source.abuse_contact but also by a field determined by the bot configuration? Eventually it would also be nice to use this field also included in the group-by in e.g. the subject, but that's just for the future.


I'm willing to submit this as a PR myself, I just wanted to create a place for input and also most important to check beforehand whether this has any chance of being merged.

So far smtp_batch inserts each event as value into redis with the bot_id + source.abuse_contact as key: https://github.com/certtools/intelmq/blob/a5f2f9c3499a02b0ac9aaf38a4e8ca13fd5cf769/intelmq/bots/outputs/smtp_batch/output.py#L77-L78

It then later uses redis.keys() to retrieve all keys with a certain prefix (bot_id): https://github.com/certtools/intelmq/blob/a5f2f9c3499a02b0ac9aaf38a4e8ca13fd5cf769/intelmq/bots/outputs/smtp_batch/output.py#L224

So the most straight forward way would be to simply extend the redis-key with the value of the field that should be included in the logical group-by. But we'd need to be careful, the retrieved redis-key (here mail_record) is still used to obtain the destination E-Mail address: https://github.com/certtools/intelmq/blob/a5f2f9c3499a02b0ac9aaf38a4e8ca13fd5cf769/intelmq/bots/outputs/smtp_batch/output.py#L282 In order to solve this I see two options

  1. separate the values used to derive the redis-key with a special character like | -> in this case we'd need to make sure this character doesn't occur in any of the values -> replacing (but probably modifying the values is a bad idea) or escaping (a bit more complex) needed
  2. we could switch to not using the retrieved redis-key anymore. Instead we know all retrieved events must have the same source.abuse_contact so we could simply get the value from an arbitrary event we retrieved from redis (same thing if we later want to gain access to the other field we included in the group-by). Decoupling the redis-key from the rest of the logic also should make it possible to simply hash the redis-key which avoids all trouble with the redis-key getting much longer with more keys involved in the group-by logic. So in this case the redis-key would look like this f"{self.key}{hash(source.abuse_contact, other_field1, other_field2, ...)}"

As I mentioned above I've got two questions

  1. are you open generally for such additions to the smtp_batch bot?
  2. if so, I'm happy about any input you have regarding my sketch/draft of an implementation (I think, I'd probably go for the 2nd approach described above)

Lukas-Heindl avatar Mar 20 '25 17:03 Lukas-Heindl

whether this has any chance of being merged.

Sure!

@e3rd is the original author of the bot and likely a user of it as well so I assume he can give you some feedback on your questions.

sebix avatar Mar 20 '25 20:03 sebix

Thanks, I ll definitely take a look at it, thanks!

e3rd avatar Mar 20 '25 22:03 e3rd

The 2nd approach really seems great! You make the group-by field defaulted to source.abuse_contact to keep it backwards compatible, right?

I am really ashamed I slowed down your PR by three weeks! Looking forward to the PR, please tag me there so that I can review.

e3rd avatar Apr 07 '25 16:04 e3rd

All good.

Yes, the solution should be backwards compatible (also we need to force source.abuse_contact always being in the list of fields for which the events are grouped-by, since this field is always needed to send the mail so it has to be the same for all events). Then I'll go and try approach 2.

Probably this will be a while because of time constraints and I need to setup my development environment first (so I can test all this).

Lukas-Heindl avatar Apr 09 '25 14:04 Lukas-Heindl

Fingers crossed!

e3rd avatar Apr 09 '25 14:04 e3rd

If you don't mind, I'd suggest to add a configuration option too. What do you think?

include_into_body: Optional[int]
""" Insert the first N characters of the attachment to the body. """

If bigger than 0, the part of the attachment contents would be put into the body apart from being attached.

e3rd avatar Apr 09 '25 16:04 e3rd

Sounds interesting. Especially with the idea of introducing jinja2 to this bot (I know there already is the templated mail bot, but it does not allow for batching the events). As I think about your suggestion the open issue would be how to display the start for the attachment into the body. But this might also be solved with jinja.

Also I'm not decided on whether it should be something like, only include the start of the attachment if it is smaller than N (so it fully fits into the body) or unconditionally the first N chars of the attachment (acting as some sort of preview).

Lukas-Heindl avatar Apr 09 '25 16:04 Lukas-Heindl

Talking more generally about this bot, I am curious. Do you know why the attachment always is zipped? Is is common the attachment is that large and compression really brings a benefit (or is necessary due to the size and limitations in E-Mail servers rejecting large attachments)?

Lukas-Heindl avatar Apr 09 '25 16:04 Lukas-Heindl

how to display the start

just the plain text

full if small VS preview

I have no opinion, both fit :) The preview seems slightly better as it is consistent, but I don't know.
Also, it might be better to be the number of events, not number of chars.

Do you know why the attachment always is zipped?

Unfortunately, according to our experiences, mail servers are mad. Sometimes, 10 MB caused the e-mail to vanish if I remember well. To be consistent (so that the partners could automate), we've enforced the zip for all.

But if you don't like it, just add a flag for that.

e3rd avatar Apr 09 '25 16:04 e3rd

Adding onto the ideas regarding the smtp_batch bot:

  • According to my test intelmq.bots.outputs.smtp_batch.output --cli --send <bot-id> does not exit with exit-code != 0 if sending the E-Mail failed (e.g. wrong password) -> in my opinion this should be changed allowing the user to trigger an alert (to the server-admin) something in the operation of intelmq has gone wrong

Lukas-Heindl avatar Apr 17 '25 12:04 Lukas-Heindl

Great enhancement, of course!

e3rd avatar Apr 17 '25 13:04 e3rd

I am a bit confused https://github.com/certtools/intelmq/blob/ec37d0c5a651330faf568ce3f049a70ecfcef9d6/intelmq/bots/outputs/smtp_batch/output.py#L283-L286

https://github.com/certtools/intelmq/blob/ec37d0c5a651330faf568ce3f049a70ecfcef9d6/intelmq/bots/outputs/smtp_batch/output.py#L301-L304

@e3rd do you still remember what the idea here was? Why building the mail if count is None, particularly as the mail is not yielded anyhow? (can it even be None? https://docs.python.org/3/library/functions.html#len is not too specific about it) So the control flow looks a bit odd to me. Also do we really want to send mail in case there are no events captured?

(And if you already want to have a look, I'm drafting at https://github.com/Lukas-Heindl/intelmq/tree/smtpBatchGroup)

Lukas-Heindl avatar Apr 24 '25 16:04 Lukas-Heindl

Good point, a design flaw, get rid of it. The draft looks nice! :)

e3rd avatar Apr 24 '25 17:04 e3rd

Ok this is getting more like a (minor) refactoring of the bot.

I noticed the parameters allowed_fieldnames and fieldnames_translation are not documented yet. Writing the documentation for these parameters i noticed I need to include something like if you are inconsistent here, the bot will crash

Reason: https://github.com/certtools/intelmq/blob/38b9304e7e6d49ef2f2086af292aeda0e130c19d/intelmq/bots/outputs/smtp_batch/output.py#L273 will crash in case the ordered_keys (intersection of allowed_fieldnames and the keys defined in the event) contains a key which is not defined in fieldnames_translation.

Personally I'd be okay with this (shifting the responsibility to the user/admin to do the configuration right). Still I wanted to raise this for discussion whether the bot should be more robust in this regard. We could use dict.get to set a default (maybe the name of the key in allowed_fieldnames) when doing the translation.

Or we could remove the allowed_fieldnames option and use fieldnames_translation.keys() as a replacement. Though, just modifying the allowed_fieldnames option to remove a column from the csv without needing to toucht the translation might be an ease of use we want to keep.

Lukas-Heindl avatar Apr 28 '25 14:04 Lukas-Heindl

Except for the include_into_body / preview, I think everything discussed here is implemented in #2610

Lukas-Heindl avatar Apr 29 '25 15:04 Lukas-Heindl

Addition to the collectio of ideas:

  1. optionally use paging to retrieve the data from redis (I experienced some OOM with too less RAM+swap -- paging the data should mitigate a bit)
  2. same reason like for 1.: Currently we first collect all data for all mails just for the nice display what mails are in the pipeline. Maybe some way of "streaming" the data from redis directly to mail (so obtain all the data for one mail -> send the mail and loop this)

Lukas-Heindl avatar May 08 '25 17:05 Lukas-Heindl

(This would be a nice improvement however I imagine this would take a lot of effort.)

e3rd avatar Jul 17 '25 10:07 e3rd