scanner_user_agents Schema Proposal

This is a proposal for a schema change which is substantial enough to be worth considering in the early phase:

Proposal:

[
  {
    "match": "masscan(-ng)?\/",
    "name": "Masscan",
    "url": "https://github.com/robertdavidgraham/masscan"
    "examples": [
      "masscan/1.3",
      "masscan-ng/1.3 (https://github.com/bi-zone/masscan-ng)",
      "masscan/1.3 (https://github.com/robertdavidgraham/masscan)",
    ],
    "known_ips": [],
    "reviewed_at": "2022-06-29",
    "confidence": "high",
  }
]

match: An RE2-compatible regular expression to match User Agents against. The RE2 engine is fast, ReDoS-safe, and is compatible with many languages
name: Name of the tool
url: Tool website or GitHub repository to get more information
examples: A list of actual User Agents from the tool. This keeps the value from the current schema of containing actual complete attack tool UAs and can be used in automated testing to ensure the regular expression actually matches expected UAs
confidence: Some kind of indication of how confident someone can be that a match from this signature would be an actual attack tool request, and not a tool masking as a real browser UA. This would be useful for filtering purposes
- High: Very high certainty that the UA is from an attack tool (e.g. if UA contains masscan it's very likely Masscan)
- Medium: Likely that the UA is from an attack tool (e.g. if the UA is a very old browser UA and known to be spoofed by a tool)
- Low: Potentially an attack tool, but also high likelihood that it is a regular browser request

Benefit

The main benefit of the proposed schema is the more flexible and future-proof matching of User Agents which avoids the current need for creating multiple entries for the same tool in order to accommodate different version numbers and URLs present in the UA. Exact User Agents are still captured in the examples list which is something that is unique to this project (as far as I have seen). The url and confidence would make a match more actionable to an analyst as they would know where to get more information as well as how confident they can be in the finding.

If this looks like a good idea, I will gladly help converting the current entries to the new format!

Jun 28 '22 13:06 michenriksen

I like the look of that, I think it needs a creation date or something like that. I called it last seen before but that probably isn't right. I just wanted something that would could be used to indicate if any needed double checking if they hadn't been updated for a while, especially things like the big scanners, I can imagine someone like Nessus arbitrarily changing their UA on a point version change just because someone wanted to.

A tool to check the regex against the examples would be cool and a useful way to validate that regex worked. I don't know much about the GitHub PR checks, but I'm fairly sure it could be built into that.

Jun 28 '22 13:06 digininja

Ah, right, I forgot about the last seen value! Perhaps we could call it something like reviewed_at to better communicate when the entry was last checked for correctness?

edit: updated the proposed schema to include a reviewed_at value.

Jun 29 '22 12:06 michenriksen

A tool to check the regex against the examples would be cool and a useful way to validate that regex worked. I don't know much about the GitHub PR checks, but I'm fairly sure it could be built into that.

Yes, automatic "unit testing" on PRs should definitely be relatively straight-forward to add. I can look into setting that up, unless you want to give it a go? :)

Jun 29 '22 12:06 michenriksen

I'm quite happy for you to set things up. Want me to give you access to make it easier?

Jun 29 '22 13:06 digininja