[smtp_batch] Additional field for grouping
Hi,
I have a usecase where I'd like to send batched E-Mails to someone with collected events for which I want to use the smtp_batch output bot. Then in this case it would be nice to break down the events for one contact further into multiple E-Mails. Would it be possible to add some option such that the group-by is not only done by source.abuse_contact but also by a field determined by the bot configuration?
Eventually it would also be nice to use this field also included in the group-by in e.g. the subject, but that's just for the future.
I'm willing to submit this as a PR myself, I just wanted to create a place for input and also most important to check beforehand whether this has any chance of being merged.
So far smtp_batch inserts each event as value into redis with the bot_id + source.abuse_contact as key:
https://github.com/certtools/intelmq/blob/a5f2f9c3499a02b0ac9aaf38a4e8ca13fd5cf769/intelmq/bots/outputs/smtp_batch/output.py#L77-L78
It then later uses redis.keys() to retrieve all keys with a certain prefix (bot_id):
https://github.com/certtools/intelmq/blob/a5f2f9c3499a02b0ac9aaf38a4e8ca13fd5cf769/intelmq/bots/outputs/smtp_batch/output.py#L224
So the most straight forward way would be to simply extend the redis-key with the value of the field that should be included in the logical group-by. But we'd need to be careful, the retrieved redis-key (here mail_record) is still used to obtain the destination E-Mail address:
https://github.com/certtools/intelmq/blob/a5f2f9c3499a02b0ac9aaf38a4e8ca13fd5cf769/intelmq/bots/outputs/smtp_batch/output.py#L282
In order to solve this I see two options
- separate the values used to derive the redis-key with a special character like
|-> in this case we'd need to make sure this character doesn't occur in any of the values -> replacing (but probably modifying the values is a bad idea) or escaping (a bit more complex) needed - we could switch to not using the retrieved redis-key anymore. Instead we know all retrieved events must have the same
source.abuse_contactso we could simply get the value from an arbitrary event we retrieved from redis (same thing if we later want to gain access to the other field we included in the group-by). Decoupling the redis-key from the rest of the logic also should make it possible to simply hash the redis-key which avoids all trouble with the redis-key getting much longer with more keys involved in the group-by logic. So in this case the redis-key would look like thisf"{self.key}{hash(source.abuse_contact, other_field1, other_field2, ...)}"
As I mentioned above I've got two questions
- are you open generally for such additions to the smtp_batch bot?
- if so, I'm happy about any input you have regarding my sketch/draft of an implementation (I think, I'd probably go for the 2nd approach described above)
whether this has any chance of being merged.
Sure!
@e3rd is the original author of the bot and likely a user of it as well so I assume he can give you some feedback on your questions.
Thanks, I ll definitely take a look at it, thanks!
The 2nd approach really seems great! You make the group-by field defaulted to source.abuse_contact to keep it backwards compatible, right?
I am really ashamed I slowed down your PR by three weeks! Looking forward to the PR, please tag me there so that I can review.
All good.
Yes, the solution should be backwards compatible (also we need to force source.abuse_contact always being in the list of fields for which the events are grouped-by, since this field is always needed to send the mail so it has to be the same for all events).
Then I'll go and try approach 2.
Probably this will be a while because of time constraints and I need to setup my development environment first (so I can test all this).
Fingers crossed!
If you don't mind, I'd suggest to add a configuration option too. What do you think?
include_into_body: Optional[int]
""" Insert the first N characters of the attachment to the body. """
If bigger than 0, the part of the attachment contents would be put into the body apart from being attached.
Sounds interesting. Especially with the idea of introducing jinja2 to this bot (I know there already is the templated mail bot, but it does not allow for batching the events). As I think about your suggestion the open issue would be how to display the start for the attachment into the body. But this might also be solved with jinja.
Also I'm not decided on whether it should be something like, only include the start of the attachment if it is smaller than N (so it fully fits into the body) or unconditionally the first N chars of the attachment (acting as some sort of preview).
Talking more generally about this bot, I am curious. Do you know why the attachment always is zipped? Is is common the attachment is that large and compression really brings a benefit (or is necessary due to the size and limitations in E-Mail servers rejecting large attachments)?
how to display the start
just the plain text
full if small VS preview
I have no opinion, both fit :) The preview seems slightly better as it is consistent, but I don't know.
Also, it might be better to be the number of events, not number of chars.
Do you know why the attachment always is zipped?
Unfortunately, according to our experiences, mail servers are mad. Sometimes, 10 MB caused the e-mail to vanish if I remember well. To be consistent (so that the partners could automate), we've enforced the zip for all.
But if you don't like it, just add a flag for that.
Adding onto the ideas regarding the smtp_batch bot:
- According to my test
intelmq.bots.outputs.smtp_batch.output --cli --send <bot-id>does not exit with exit-code != 0 if sending the E-Mail failed (e.g. wrong password) -> in my opinion this should be changed allowing the user to trigger an alert (to the server-admin) something in the operation of intelmq has gone wrong
Great enhancement, of course!
I am a bit confused https://github.com/certtools/intelmq/blob/ec37d0c5a651330faf568ce3f049a70ecfcef9d6/intelmq/bots/outputs/smtp_batch/output.py#L283-L286
https://github.com/certtools/intelmq/blob/ec37d0c5a651330faf568ce3f049a70ecfcef9d6/intelmq/bots/outputs/smtp_batch/output.py#L301-L304
@e3rd do you still remember what the idea here was? Why building the mail if count is None, particularly as the mail is not yielded anyhow? (can it even be None? https://docs.python.org/3/library/functions.html#len is not too specific about it) So the control flow looks a bit odd to me.
Also do we really want to send mail in case there are no events captured?
(And if you already want to have a look, I'm drafting at https://github.com/Lukas-Heindl/intelmq/tree/smtpBatchGroup)
Good point, a design flaw, get rid of it. The draft looks nice! :)
Ok this is getting more like a (minor) refactoring of the bot.
I noticed the parameters allowed_fieldnames and fieldnames_translation are not documented yet. Writing the documentation for these parameters i noticed I need to include something like if you are inconsistent here, the bot will crash
Reason:
https://github.com/certtools/intelmq/blob/38b9304e7e6d49ef2f2086af292aeda0e130c19d/intelmq/bots/outputs/smtp_batch/output.py#L273
will crash in case the ordered_keys (intersection of allowed_fieldnames and the keys defined in the event) contains a key which is not defined in fieldnames_translation.
Personally I'd be okay with this (shifting the responsibility to the user/admin to do the configuration right). Still I wanted to raise this for discussion whether the bot should be more robust in this regard. We could use dict.get to set a default (maybe the name of the key in allowed_fieldnames) when doing the translation.
Or we could remove the allowed_fieldnames option and use fieldnames_translation.keys() as a replacement. Though, just modifying the allowed_fieldnames option to remove a column from the csv without needing to toucht the translation might be an ease of use we want to keep.
Except for the include_into_body / preview, I think everything discussed here is implemented in #2610
Addition to the collectio of ideas:
- optionally use paging to retrieve the data from redis (I experienced some OOM with too less RAM+swap -- paging the data should mitigate a bit)
- same reason like for 1.: Currently we first collect all data for all mails just for the nice display what mails are in the pipeline. Maybe some way of "streaming" the data from redis directly to mail (so obtain all the data for one mail -> send the mail and loop this)
(This would be a nice improvement however I imagine this would take a lot of effort.)