[BUG] HTML/JavaScript recursion
Describe the bug We've identified a bug in the HTML/JavaScript identification and extraction code. It's possible that libmagic will incorrectly identify a file as "text/html" while YARA will correctly identify a file as "javascript_file". When this happens, the ScanHtml scanner is applied to the JavaScript file and enters a recursive file extraction loop until the maximum depth is hit.
Steps to reproduce Steps to reproduce the behavior:
- Find an HTML file that contains embedded JavaScript that gets tasted as "text/html" by libmagic
- Run the file through Strelka
- Check for Python logs that describe "exceeded maximum depth" or scan results where the same HTML file is being repeatedly extracted
Expected behavior JavaScript should not be tasted as HTML.
Screenshots N/A
Server and project version
- OS: Ubuntu Bionic
- Commit Hash: N/A (first release)
Additional context N/A
I identified cases where this recursion was happening by looking at file.depth:15 (default limit). The frequency is extremely low (0.00003%). The attached file, a VIM macro, triggers this bug.
Analyzing a large volume of events, it's apparent the mime type matching for text/html is overly zealous.
html_file: 1.05
text/html: 1.6
both: 1
I see two solutions:
-
Remove the
text/htmlmime type from the defaultScanHtmlconfiguration.While analyzing the data on this problem, it seems most of what
text/htmlcatches, buthtml_filedoes not is either not HTML or is broken HTML (from split or partial responses). Some exceptions are things like HTML files that start with white space or comments, which can be addressed by improving thehtml_fileYara. -
Prevent
ScanHtmlfrom being a child (source) of itself.This will prevent the recursion problem, and may be applicable in some other situations if implemented as a configuration. Some scanners should normally recurse. However, it won't prevent mostly unhelpful analysis of files that will not yield interesting results.
e.g.
'ScanHtml':
- positive:
flavors:
- 'hta_file'
- 'text/html'
- 'html_file'
exclude_sources:
- ScanHtml
priority: 5
options:
parser: "html5lib"
The attached file triggers the javascript variety of this bug.