fix Incomplete multi-character sanitization HTML Tags in a Web Page (XSS)

Open odaysec opened this issue 8 months ago • 0 comments

https://github.com/uber/baseweb/blob/f7b840f829a949b4fdb3fca707c56575b0ffe31b/src/icon/build-icons.js#L34-L35

fix the problem ensure that all instances of the targeted pattern are removed from the input string, even if they appear consecutively or are nested. The best way to achieve this is to apply the regular expression replacement repeatedly until no more replacements can be performed. This approach ensures that all occurrences of the pattern are effectively removed. modify the reactify function to repeatedly apply the regular expression replacement for HTML comments until the input string no longer changes. This will ensure that all HTML comments are removed, regardless of their structure.

Sanitizing untrusted input is a common technique for preventing injection attacks and other security vulnerabilities. Regular expressions are often used to perform this sanitization. However, when the regular expression matches multiple consecutive characters, replacing it just once can result in the unsafe text reappearing in the sanitized input. Attackers can exploit this issue by crafting inputs that, when sanitized with an ineffective regular expression, still contain malicious code or content. This can lead to code execution, data exposure, or other vulnerabilities.

POC

Consider the following JavaScript code that aims to remove all HTML comment start and end tags:

str.replace(/<!--|--!?>/g, "");

Given the input string "<!>", the output will be "", which still contains an HTML comment. One possible fix for this issue is to apply the regular expression replacement repeatedly until no more replacements can be performed. This ensures that the unsafe text does not re-appear in the sanitized input, effectively removing all instances of the targeted pattern:

function removeHtmlComments(input) {  
  let previous;  
  do {  
    previous = input;  
    input = input.replace(/<!--|--!?>/g, "");  
  } while (input !== previous);  
  return input;  
}

Another is the following regular expression intended to remove script tags:

str.replace(/<script\b[^<]*(?:(?!<\/script>)<[^<]*)*<\/script>/g, "");

If the input string is "<script>alert(123)", the output will be "", which still contains a script tag. A fix for this issue is to rewrite the regular expression to match single characters ("<" and ">") instead of the entire unsafe text. This simplifies the sanitization process and ensures that all potentially unsafe characters are removed:

function removeAllHtmlTags(input) {  
  return input.replace(/<|>/g, "");  
}

Another potential fix is to use the popular sanitize-html npm library. It keeps most of the safe HTML tags while removing all unsafe tags and attributes.

const sanitizeHtml = require("sanitize-html");
function removeAllHtmlTags(input) {  
  return sanitizeHtml(input);  
}

References

A1 Injection. Removing all script tags from HTML with JS regular expression. CWE-20. CWE-80. CWE-116.

Apr 15 '25 23:04 odaysec