zeek icon indicating copy to clipboard operation
zeek copied to clipboard

Updates needed for software framework.

Open JustinAzoff opened this issue 2 months ago • 0 comments

This is mostly notes for me to remember what to fix, but I see a few issues lately with the software framework, particularly related to http.

Azure versions

We have ignored_user_agents for Browsers, but not for servers. There is a Microsoft cloud proxy thing that sets the version to the region/instance id, like so:

ECAcc (mil/6C98)
ECAcc (mil/6C22)
ECAcc (mil/6C40)
ECAcc (mil/6C60)
ECAcc (mil/6C28)
ECAcc (mil/6C45)
ECAcc (mil/6C22)
ECAcc (mil/6C45)

See https://learn.microsoft.com/en-us/azure/cdn/cdn-verizon-http-headers

Example Via request header

Via: HTTP/1.1 ECD (dca/1A2B)

This causes almost every single one of these requests to trigger a new HTTP::SERVER.

Proxy load

In a change I made a while ago, I moved the version parsing to the proxies, which did reduce the worker load quite a bit, but the software framework found function still sends every found software up to the proxies. Something like this in found could help:

        if (info?$unparsed_version) {
            if ([info$host, info$unparsed_version] in found_cache)
                return T;
            add found_cache[info$host, info$unparsed_version];
        }

where found_cache is a set[addr, string] with create_expire set to something reasonable. It would be great if that could sync up with the

global tracked: table[addr] of SoftwareSet &create_expire=1day;

Multiple browsers

The software framework assumes that for each software type, a host has one and only version of that software type. This makes sense for things like ssh server, but now with things like electron apps and chrome/edge/safari it's not uncommon for a single host to be making multiple concurrent http requests with alternating user-agents. Or a host could be running two different http servers for two different API services. Every time the host flip-flops it triggers new software log entries that don't actually contain new information.

Looking on one network, 70% of the last 1,000,000 software log entries are duplicates.

JustinAzoff avatar May 02 '24 12:05 JustinAzoff