intelmq icon indicating copy to clipboard operation
intelmq copied to clipboard

(intelmq-ai) improve parser

Open aaronkaplan opened this issue 2 months ago • 5 comments

  • [ ] improve parser, make it more like conform to the coding standards
  • [ ] add type checks (pydantic?) for all fields returned by the LLM (possibly re-run?) . Or do it on the function calling side...
  • [ ] remove main
  • [ ] add test data
  • [ ] add unit tests (with repeatable seed and temperature=0)
  • [ ] document it
  • [ ] merge it

aaronkaplan avatar Oct 23 '25 19:10 aaronkaplan

As we have seen in the research and in the presentation, the data that the LLMs assign to a IDF field, is not always correct. For example, a file path was added to malware.hash.md5. To mitigate this, we should add more checks for the hashes (in harmonization.conf). Maybe a regex is sufficient, otherwise we need to add a datatype which could also sanitize the hashes.

Another sanity check that we can add easily to the parser is to search for the values (e.g. a hash) in the input. If it cannot be found in the source, the parser should not accept the resulting event.

sebix avatar Oct 24 '25 07:10 sebix

This could / should be done as a function call maybe. @Brandl , what do you think? @sebix I added it to the checklist above

aaronkaplan avatar Oct 24 '25 07:10 aaronkaplan

Isn't it sufficient to extend the existing output type specification that uses the pydantic base model?

sebix avatar Oct 24 '25 07:10 sebix

Is that "string" ? String would be too broad. ... maybe yeah, could work as well. Let's check with Brandl

aaronkaplan avatar Oct 24 '25 07:10 aaronkaplan

The model does all kind of checks using the IntelMQ types. See intelmq/lib/harmonization.py for the validation and sanitation functions and intelmq/etc/harminization.conf for the mapping of field to type, including some restrictions like string length and regex.

sebix avatar Oct 24 '25 07:10 sebix