combine
combine copied to clipboard
Handling of "orphan" indicators
Today, indicators that for some reason do not match our "IPv4" or "FQDN" validation just stay there without a type. An example:
$ cat harvest.csv | grep -v FQDN | grep -v IPv4
"entity","type","direction","source","notes","date"
"2001:41d0:8:dcd4::1","","inbound","http://www.blocklist.de/lists/apache.txt","","2014-09-04"
"2002:5f18:8f82::5f18:8f82","","inbound","http://www.blocklist.de/lists/apache.txt","","2014-09-04"
"2002:c3d3:9a9f::c3d3:9a9f","","inbound","http://www.blocklist.de/lists/apache.txt","","2014-09-04"
"2a00:1210:fffe:145::1","","inbound","http://www.blocklist.de/lists/apache.txt","","2014-09-04"
"2a00:1210:fffe:72::1","","inbound","http://www.blocklist.de/lists/apache.txt","","2014-09-04"
"2a01:238:20a:202:1000::25","","inbound","http://www.blocklist.de/lists/apache.txt","","2014-09-04"
"2a01:540:2:bd5d:d849:1e69:7736:be41","","inbound","http://www.blocklist.de/lists/apache.txt","","2014-09-04"
"2a03:7380:140:3:a90f:3bd1:d8d9:3485","","inbound","http://www.blocklist.de/lists/apache.txt","","2014-09-04"
"2a03:7380:140:3:b86c:62e8:3e0e:a0fb","","inbound","http://www.blocklist.de/lists/apache.txt","","2014-09-04"
"2a03:7380:2380:0:501b:91a5:76ff:8fa8","","inbound","http://www.blocklist.de/lists/apache.txt","","2014-09-04"
"2a03:7380:2380:0:95db:5adb:685d:a0f0","","inbound","http://www.blocklist.de/lists/apache.txt","","2014-09-04"
"2001:41d0:1:c9b2::1","","inbound","http://www.blocklist.de/lists/bots.txt","","2014-09-04"
"2a01:430:17:1::ffff:376","","inbound","http://www.blocklist.de/lists/bots.txt","","2014-09-04"
"Export","","inbound","http://virbl.org/download/virbl.dnsbl.bit.nl.txt","","2014-09-04"
"ckaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa","","outbound","http://www.nothink.org/blacklist/blacklist_malware_dns.txt","","2014-09-04"
We are not interested (for now) on IPv6 and the other stuff seem like parsing errors.
I believe we should filter out the indicators that do not match an specific type.
IPv6, definitely we can just tag and ignore for now.
The Export
indicator from http://virbl.org/download/virbl.dnsbl.bit.nl.txt is actually a bug.
Interestingly, http://www.nothink.org/blacklist/blacklist_malware_dns.txt actually does list ckaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
, which we can filter out obviously but it's interesting that they let some bad data through.
I was thinking of just filtering out everything that the IPv4 and FQDN stuff do not recognize.
For the bad data, sure, we just filter it out. IPv6 is something we should add as a future enhancement because that's eventually going to be relevant, particularly as a research question.
Sure, but then it becomes a handler here when we are ready :) : https://github.com/mlsecproject/combine/blob/master/thresher.py#L9-L19
Exactly. We add the proper regex now in thresher, but winnower can filter it out (more specifically, only pass types it knows about). Or maybe just have IPv6 output as an option in combine.cfg
?
Well, if you have a good regex for IPv6 validation, we could just add that right away.
I think the "right" answer is for combine.cfg
have a "list of indicator types I want outputted" in the winnower section, which defaults at ("IPv4", "FQDN"). Ideally you should be able to override that (or select a few others only) from the command line.
I think that's the right way to go. And I'll just use something from http://stackoverflow.com/questions/53497/regular-expression-that-matches-valid-ipv6-addresses ;)
Curious - does anybody have a use case for consuming IPv6 indicators right now? I see a lot more of these in the feeds, though I haven't investigated them yet.
I'd just drop them for now. That was my original suggestion.
That is in fact what we do. Just thinking about when we should start doing something with them.