malheur Inconsistent results with increment mode

Whenever I run Malheur in increment mode and feed it one report at a time, I don't get any results, forcing me to rerun it against all previous reports. This yields a noticeable overhead of processing time that likely shouldn't need to be done I'd imagine. Using increment mode with a batch of reports seems to produce the desired results, but I am not able to do this automatically as our system (Cuckoo) feeds reports out one-by-one. Would there be any chance to support single reports in increment mode?

Aug 17 '15 13:08 KillerInstinct

There is a short and a TLDR-answer to your question ;). Theoretically, it is possible to adapt Malheur to support single reports in incremental mode; however, this setup currently clashes with the design of the tool.

Malheur is designed to alternate between classification (assigning malware to known clusters) and clustering (grouping the remaining malware to new clusters) in incremental mode. If only a single report is provided, the tool will assign it to a known cluster or try to group a single report into a new cluster. Obviously, the last case doesn't make much sense. However, Malheur keeps track of so-called rejected clusters, that is, clusters which are too small to provide reasonable results. The one-report cluster will be rejected in the first run and only come back into the analysis if a sufficient amount of similar reports is added.

To make a long story short, I think it is possible to add an "online" mode which adds individual reports to the tool, but the output will be different from feeding batches of reports to the tool. My feeling is that with infinite data both approaches will converge to the same clustering, but this will take very long ;)

Should we try to construct an online mode for Malheur? I could have a look at it, if you are okay with the implications.

Aug 17 '15 20:08 rieck

There's several people who'd be interested in that functionality I think. :)

Another question then -- would it be possible to have a sort-of hybrid? EG for our use case we could run malheur regularly (non incremental mode) for say, the first 5000/10000 analyses without much of a performance hit. Then be able to switch over to the 'online' mode so that it has a baseline to work off of?

EDIT: To clarify, this 'hybrid' mode would be on our end. So Cuckoo would run X amount of samples normally, and they after hitting a threshold, switch to 'online' mode. Mainly curious if this 'baseline' technique would actually add any value.

Aug 20 '15 11:08 KillerInstinct

I have to think about an online mode. A hybrid solution would be an option, but then you could also collect samples in batches and send them out irregularly. I'll have a look at the code in the next two weeks.

EDIT: So, the idea would be to put a sample in the sandbox, run Malheur over its report immediately and get a decision like cluster X or unknown, right? What kind of format would be supplied to Malheur? XML? How would the results be interpreted? Malheur sends out plain text.

Aug 22 '15 09:08 rieck

That's correct! We use it as a 'reporting' module in Cuckoo -- which is, Malheur generates a report based off of other reports (JSON) so that it can be compared with previous samples that have been analyzed by Cuckoo to find similarly behaving analyses.

Malheur receives pure JSON but is converted into MIST. During this conversion process we simplify the set of data. I can provide examples of JSON reports, or example output results from Malheur if need be. Here's the information we collect though:

https://github.com/brad-accuvant/cuckoo-modified/blob/master/modules/reporting/malheur.py#L52

Aug 23 '15 01:08 KillerInstinct

Hello Brad, excellent work your augmented Cuckoo I promise myself I will test it tomorrow. I did also write before a MIST converter for the standard Cuckoo reports and my approach (without modifying Malheur ) was to determine the cluster with the problem that as Rieck pointed out you might incur always in isolated clusters. The un-hortodox solution I used was in those cases to find the next closer cluster and mark the prediction as a low confidence and then still accumulating each new sample in a queue of fixed length, do the batch train mode and flush it ( I also attempted another way where you could have rolling window instead of a queue ). That worked well for me but not a clean solution.

Also another improvement but this is unrelated with this issue is that we could also adopt a non-sql DB instead of the text based approach.

Aug 23 '15 22:08 robomotic

Brad it would be also helpful as you mentioned to have your dataset (maybe somehow ordered by execution date) so that we can simulate the whole process instead of running the samples each time .

Aug 23 '15 22:08 robomotic

Yes, it would be great if you could provide a small set of ordered samples.

Aug 24 '15 08:08 rieck

The actual code to the cukoo to mist conversion was merged into this tool actually: http://sourceforge.net/projects/cuckoo2mist/

Aug 24 '15 09:08 robomotic

I've just sent you both some sanitized MIST reports to use for testing purposes. The filename of each is the analysis ID from the Cuckoo instance, so you should be able to reproduce it based on that and the file timestamps.

Thanks! -Brad

Aug 24 '15 14:08 brad-sp

Thanks a lot. I'll have a look at it.

Aug 24 '15 14:08 rieck

Dear @brad-accuvant and @robomotic

first of all thanks for providing the files and implementing a conversion to MIST. I really like the short naming of events, e.g. file drop and reg access.

However, I noticed something strange with the provided reports. It seems that all events are sorted. This somewhat contradicts with the main idea behind Malheur to analyze small sequences (n-grams) of observed events.

Is this a limitation of Cuckoo sandbox?
@robomotic how does your MIST conversion proceed?

Aug 25 '15 14:08 rieck

It's actually generated from the summary results instead of the API logs, as the summary results are a bit more stable, it allows us to pull in more information easily, reduces the size of the reports while retaining critical info, and automatically eliminates duplicates or near-duplicates that would be observed if generating reports directly from API calls. The file/registry accesses actually aren't sorted, just what you're seeing in those reports is going through each category of summary results and presenting them. So files and registry keys are ordered in order of access (or modification, or read, etc), as are resolved hostnames and connected IP addresses (I think). Similarly for APIs resolved at runtime.

Aug 25 '15 15:08 brad-sp

Hello, had a look at your MIST reports but I am a little bit concerned about several issues

you have potentially temporally ordered the behavioural by categories i.e. you have all the file access followed by the file write followed by reg etc. etc. this will create a false n-gram histogram
you have also included static signatures that are not behavioural that will bias the clusterer such as the antivirus signatures and the pe static analysis such as icon/section and imphash

My suggestion is that we need to preserve order from the summary and remove static analysis and pure signatures. You can use those other predictions/signatures in a stacked classifier or with ADA Boost but not directly in Malheur.

Aug 25 '15 15:08 robomotic

Thanks! I'll look into implementing those changes.

-Brad

Aug 25 '15 17:08 brad-sp

Ahah wow ok ! :-)

Sent from my iPhone

On 25 Aug 2015, at 18:09, Brad Spengler [email protected] wrote:

Thanks! I'll look into implementing those changes.

-Brad

— Reply to this email directly or view it on GitHub.

Aug 25 '15 19:08 robomotic

malheur malheur copied to clipboard

Inconsistent results with increment mode

malheur
malheur copied to clipboard