intelmq (intelmq-ai) improve parser

[ ] improve parser, make it more like conform to the coding standards
[ ] add type checks (pydantic?) for all fields returned by the LLM (possibly re-run?) . Or do it on the function calling side...
[ ] remove main
[ ] add test data
[ ] add unit tests (with repeatable seed and temperature=0)
[ ] document it
[ ] merge it

Oct 23 '25 19:10 aaronkaplan

As we have seen in the research and in the presentation, the data that the LLMs assign to a IDF field, is not always correct. For example, a file path was added to malware.hash.md5. To mitigate this, we should add more checks for the hashes (in harmonization.conf). Maybe a regex is sufficient, otherwise we need to add a datatype which could also sanitize the hashes.

Another sanity check that we can add easily to the parser is to search for the values (e.g. a hash) in the input. If it cannot be found in the source, the parser should not accept the resulting event.

Oct 24 '25 07:10 sebix

This could / should be done as a function call maybe. @Brandl , what do you think? @sebix I added it to the checklist above

Oct 24 '25 07:10 aaronkaplan

Isn't it sufficient to extend the existing output type specification that uses the pydantic base model?

Oct 24 '25 07:10 sebix

Is that "string" ? String would be too broad. ... maybe yeah, could work as well. Let's check with Brandl

Oct 24 '25 07:10 aaronkaplan

The model does all kind of checks using the IntelMQ types. See intelmq/lib/harmonization.py for the validation and sanitation functions and intelmq/etc/harminization.conf for the mapping of field to type, including some restrictions like string length and regex.

Oct 24 '25 07:10 sebix