Cortexutils extractor ip detection
Fix proposal for #198.
- 1.0.0.127.localhost.localdomain. => does not match
- 192.168.0.10 => does match
- 192.168.0.1/24 => does match
- 10.0.0.0/8 => does match
- 10.0.0.0/ => does not match
As the extractor should extract observables that aren't available as "a single line" also, line start and end markers (^ and $) are not really applicable here and with that something like 1.0.0.127.localhost.localdomain will be extracted as 1.0.0.127.
I need to think about how (if?) it's possible to distinguish between an IP in CIDR notation which is in-line and the mentioned "domain case".
Even with the fix, 999.888.777.666 would match as a valid IPv4 address when clearly, it is not.
Why do you have to use regular expressions to match IP addresses? Why not use something like ip_address from the standard library?
def is_ip(s):
try:
ip_address(s)
return True
except ValueError:
return False
In the end, there is not different types for IPv4 and IPv6 in TheHive, only ip.
Also, maybe it would be worth considering adding a network type to TheHive instead of having a range in CIDR notation considered as a valid IPv4 (and you guessed it, ip_network could be used to validate a network in CIDR notation).
Why do you have to use regular expressions to match IP addresses?
Because the extractor should provide an easy way to retrieve observables from reports even if they are in-line and not explicit given. Analyzers do not have to use the extractor and can implement an own artifacts method.
Why not use something like ip_address from the standard library?
That would indeed be possible - after finding a possible IP address using regex.
Why do you have to use regular expressions to match IP addresses?
Because the extractor should provide an easy way to retrieve observables from reports even if they are in-line and not explicit given. Analyzers do not have to use the extractor and can implement an own artifacts method.
I'm sorry, I don't understand why this makes using regular expressions a hard requirement.
Why not use something like ip_address from the standard library?
That would indeed be possible - after finding a possible IP address using regex.
Again, why? ip_address can take a string as input so you could feed it with the value to check just as you feed regexp.match. Of course, this would imply a minor refactoring(*) of the Extractor class but in the end you would obtain more reliable results.
(*) The changes would be minimal and only internal to the class.
EDIT: to be clear, I volunteer to implement such changes in a PR if you deem them worth.
Again, why? ip_address can take a string as input so you could feed it with the value to check just as you feed regexp.match.
Yes, it takes a string, but cannot find addresses in a block of text. Again, the extractor should provide the functionality to "automagically" find observables/IoCs in strings which are not the observable itself, but a "wall of text".
Of course is recognizing single ip strings through regex is not the best way to achieve that. But you cannot guarantee that.
As this affects only analyzers run through MISP, it has a low prio for me until cortex 2 is released and the documentation is polished up as you're able to easily delete inappropriate IoCs (what has to be done anyway, as not every returned value is appropriate) in the overview after running the analyzer.
I'm running into an issue where the IP's extracted (ipv4) include version strings and are not valid. I fixed this in code that calls cortex (external App) and basically filters the artifacts through the a call to IP Address. You could post process after the regex results are returned with ipaddress(val).is_global to limit out some of the noise.