strelka [BUG] HTML/JavaScript recursion

Describe the bug We've identified a bug in the HTML/JavaScript identification and extraction code. It's possible that libmagic will incorrectly identify a file as "text/html" while YARA will correctly identify a file as "javascript_file". When this happens, the ScanHtml scanner is applied to the JavaScript file and enters a recursive file extraction loop until the maximum depth is hit.

Steps to reproduce Steps to reproduce the behavior:

Find an HTML file that contains embedded JavaScript that gets tasted as "text/html" by libmagic
Run the file through Strelka
Check for Python logs that describe "exceeded maximum depth" or scan results where the same HTML file is being repeatedly extracted

Expected behavior JavaScript should not be tasted as HTML.

Screenshots N/A

Server and project version

OS: Ubuntu Bionic
Commit Hash: N/A (first release)

Additional context N/A

Sep 25 '18 16:09 jshlbrd

I identified cases where this recursion was happening by looking at file.depth:15 (default limit). The frequency is extremely low (0.00003%). The attached file, a VIM macro, triggers this bug.

less.vim.txt

Jan 12 '23 00:01 ryanohoro

Analyzing a large volume of events, it's apparent the mime type matching for text/html is overly zealous.

html_file: 1.05 text/html: 1.6 both: 1

I see two solutions:

Remove the text/html mime type from the default ScanHtml configuration.

While analyzing the data on this problem, it seems most of what text/html catches, but html_file does not is either not HTML or is broken HTML (from split or partial responses). Some exceptions are things like HTML files that start with white space or comments, which can be addressed by improving the html_file Yara.
Prevent ScanHtml from being a child (source) of itself.

This will prevent the recursion problem, and may be applicable in some other situations if implemented as a configuration. Some scanners should normally recurse. However, it won't prevent mostly unhelpful analysis of files that will not yield interesting results.

e.g.

  'ScanHtml':
    - positive:
        flavors:
          - 'hta_file'
          - 'text/html'
          - 'html_file'
      exclude_sources:
          - ScanHtml
      priority: 5
      options:
        parser: "html5lib"

The attached file triggers the javascript variety of this bug.

search.js.txt

Jan 12 '23 02:01 ryanohoro