malheur
malheur copied to clipboard
Inconsistent results with increment mode
Whenever I run Malheur in increment mode and feed it one report at a time, I don't get any results, forcing me to rerun it against all previous reports. This yields a noticeable overhead of processing time that likely shouldn't need to be done I'd imagine. Using increment mode with a batch of reports seems to produce the desired results, but I am not able to do this automatically as our system (Cuckoo) feeds reports out one-by-one. Would there be any chance to support single reports in increment mode?
There is a short and a TLDR-answer to your question ;). Theoretically, it is possible to adapt Malheur to support single reports in incremental mode; however, this setup currently clashes with the design of the tool.
Malheur is designed to alternate between classification (assigning malware to known clusters) and clustering (grouping the remaining malware to new clusters) in incremental mode. If only a single report is provided, the tool will assign it to a known cluster or try to group a single report into a new cluster. Obviously, the last case doesn't make much sense. However, Malheur keeps track of so-called rejected clusters, that is, clusters which are too small to provide reasonable results. The one-report cluster will be rejected in the first run and only come back into the analysis if a sufficient amount of similar reports is added.
To make a long story short, I think it is possible to add an "online" mode which adds individual reports to the tool, but the output will be different from feeding batches of reports to the tool. My feeling is that with infinite data both approaches will converge to the same clustering, but this will take very long ;)
Should we try to construct an online mode for Malheur? I could have a look at it, if you are okay with the implications.
There's several people who'd be interested in that functionality I think. :)
Another question then -- would it be possible to have a sort-of hybrid? EG for our use case we could run malheur regularly (non incremental mode) for say, the first 5000/10000 analyses without much of a performance hit. Then be able to switch over to the 'online' mode so that it has a baseline to work off of?
EDIT: To clarify, this 'hybrid' mode would be on our end. So Cuckoo would run X amount of samples normally, and they after hitting a threshold, switch to 'online' mode. Mainly curious if this 'baseline' technique would actually add any value.
I have to think about an online mode. A hybrid solution would be an option, but then you could also collect samples in batches and send them out irregularly. I'll have a look at the code in the next two weeks.
EDIT: So, the idea would be to put a sample in the sandbox, run Malheur over its report immediately and get a decision like cluster X or unknown, right? What kind of format would be supplied to Malheur? XML? How would the results be interpreted? Malheur sends out plain text.
That's correct! We use it as a 'reporting' module in Cuckoo -- which is, Malheur generates a report based off of other reports (JSON) so that it can be compared with previous samples that have been analyzed by Cuckoo to find similarly behaving analyses.
Malheur receives pure JSON but is converted into MIST. During this conversion process we simplify the set of data. I can provide examples of JSON reports, or example output results from Malheur if need be. Here's the information we collect though:
https://github.com/brad-accuvant/cuckoo-modified/blob/master/modules/reporting/malheur.py#L52
Hello Brad, excellent work your augmented Cuckoo I promise myself I will test it tomorrow. I did also write before a MIST converter for the standard Cuckoo reports and my approach (without modifying Malheur ) was to determine the cluster with the problem that as Rieck pointed out you might incur always in isolated clusters. The un-hortodox solution I used was in those cases to find the next closer cluster and mark the prediction as a low confidence and then still accumulating each new sample in a queue of fixed length, do the batch train mode and flush it ( I also attempted another way where you could have rolling window instead of a queue ). That worked well for me but not a clean solution.
Also another improvement but this is unrelated with this issue is that we could also adopt a non-sql DB instead of the text based approach.
Brad it would be also helpful as you mentioned to have your dataset (maybe somehow ordered by execution date) so that we can simulate the whole process instead of running the samples each time .
Yes, it would be great if you could provide a small set of ordered samples.
The actual code to the cukoo to mist conversion was merged into this tool actually: http://sourceforge.net/projects/cuckoo2mist/
I've just sent you both some sanitized MIST reports to use for testing purposes. The filename of each is the analysis ID from the Cuckoo instance, so you should be able to reproduce it based on that and the file timestamps.
Thanks! -Brad
Thanks a lot. I'll have a look at it.
Dear @brad-accuvant and @robomotic
first of all thanks for providing the files and implementing a conversion to MIST. I really like the short naming of events, e.g. file drop
and reg access
.
However, I noticed something strange with the provided reports. It seems that all events are sorted. This somewhat contradicts with the main idea behind Malheur to analyze small sequences (n-grams) of observed events.
- Is this a limitation of Cuckoo sandbox?
- @robomotic how does your MIST conversion proceed?
It's actually generated from the summary results instead of the API logs, as the summary results are a bit more stable, it allows us to pull in more information easily, reduces the size of the reports while retaining critical info, and automatically eliminates duplicates or near-duplicates that would be observed if generating reports directly from API calls. The file/registry accesses actually aren't sorted, just what you're seeing in those reports is going through each category of summary results and presenting them. So files and registry keys are ordered in order of access (or modification, or read, etc), as are resolved hostnames and connected IP addresses (I think). Similarly for APIs resolved at runtime.
Hello, had a look at your MIST reports but I am a little bit concerned about several issues
- you have potentially temporally ordered the behavioural by categories i.e. you have all the file access followed by the file write followed by reg etc. etc. this will create a false n-gram histogram
- you have also included static signatures that are not behavioural that will bias the clusterer such as the antivirus signatures and the pe static analysis such as icon/section and imphash
My suggestion is that we need to preserve order from the summary and remove static analysis and pure signatures. You can use those other predictions/signatures in a stacked classifier or with ADA Boost but not directly in Malheur.
Thanks! I'll look into implementing those changes.
-Brad
Ahah wow ok ! :-)
Sent from my iPhone
On 25 Aug 2015, at 18:09, Brad Spengler [email protected] wrote:
Thanks! I'll look into implementing those changes.
-Brad
— Reply to this email directly or view it on GitHub.