(intelmq-ai) improve parser
- [ ] improve parser, make it more like conform to the coding standards
- [ ] add type checks (pydantic?) for all fields returned by the LLM (possibly re-run?) . Or do it on the function calling side...
- [ ] remove main
- [ ] add test data
- [ ] add unit tests (with repeatable seed and temperature=0)
- [ ] document it
- [ ] merge it
As we have seen in the research and in the presentation, the data that the LLMs assign to a IDF field, is not always correct. For example, a file path was added to malware.hash.md5. To mitigate this, we should add more checks for the hashes (in harmonization.conf). Maybe a regex is sufficient, otherwise we need to add a datatype which could also sanitize the hashes.
Another sanity check that we can add easily to the parser is to search for the values (e.g. a hash) in the input. If it cannot be found in the source, the parser should not accept the resulting event.
This could / should be done as a function call maybe. @Brandl , what do you think? @sebix I added it to the checklist above
Isn't it sufficient to extend the existing output type specification that uses the pydantic base model?
Is that "string" ? String would be too broad. ... maybe yeah, could work as well. Let's check with Brandl
The model does all kind of checks using the IntelMQ types. See intelmq/lib/harmonization.py for the validation and sanitation functions and intelmq/etc/harminization.conf for the mapping of field to type, including some restrictions like string length and regex.