sanitize-html `<hello` return `` when escaping content

PLEASE NOTE: make sure the bug exists in the latest patch level of the project. For instance, if you are running a 2.x version of Apostrophe, you should use the latest in that major version to confirm the bug.

To Reproduce

Step by step instructions to reproduce the behavior:

Install the version 2.17.0
Write a program wih this script

const res = sanitizeHtml('<hello', {
  allowedAttributes: {
      ...sanitizeHtml.defaults.allowedAttributes,
      span: ['data-userid'],
      '*': ['class']
    },
    disallowedTagsMode: 'recursiveEscape',
    preserveEscapedAttributes: true
});
console.log(`Result: "${res}"`)
// Result: ""

See the issue

Expected behavior

A clear and concise description of what you expected to happen.

The console log should return <hello since it isn't a valid html tag.

Describe the bug

A clear and concise description of what the bug is.

Any texts starting with < are completly removed from the final output in escape mode In complementary, text like <hello you> are returned as <hello you=""> instead of just being escaped

Details

Version of Node.js: 20.17.0 PLEASE NOTE: Only stable LTS versions (10.x and 12.x) are fully supported but we will do our best with newer versions.

Server Operating System: The server (which might be your dev laptop) on which Apostrophe is running. Linux? MacOS X? Windows? Is Docker involved?

Additional context:

Add any other context about the problem here. If the problem is specific to a browser, OS or mobile device, specify which.

Screenshots If applicable, add screenshots to help explain your problem.

Jun 02 '25 08:06 Bricklou

I think your test case is incomplete. What HTML are we sanitizing?

Jun 02 '25 09:06 BoDonkey

ah sorry, let me update my message

but my tests are:

'<hello' => '' (expected: <hello)
'<hello you' => '' (expected <hello you)
'<hello you>' => '<hello you="">' (expected <hello you>)

Jun 02 '25 09:06 Bricklou

Taking a quick look at the code, I think the problem stems from the newer onclosetag handler and how html-parser2 handles malformed tags. I think it treats the <hello as an unmatched closing tag, so it just gets discarded. I think a solution might me to parse through and escape out any malformed tags before passing it to the parser. Not sure though, just a guess. I'll mark this as a good first issue and then circle back at some point.

Jun 02 '25 12:06 BoDonkey

I think the absence of any closing > it probably just treats it as too malformed to do much with, so we might or might not be able to access this information from htmlparser2.

Jun 02 '25 13:06 boutell