htmlparser2 icon indicating copy to clipboard operation
htmlparser2 copied to clipboard

Incorrect and inconsistent behavior of self-closing tags

Open Aivean opened this issue 9 months ago • 7 comments

"htmlparser2": "^10.0.0",

Minimal repro:

const { Parser, DomHandler } = require('htmlparser2');
const render = require('dom-serializer').default;

// Test case: self-closing tag followed by content
const input = '<span />content';

// Parse with htmlparser2
const handler = new DomHandler();
const parser = new Parser(handler);
parser.end(input);

// Get the parsed DOM and render it back to HTML
const dom = handler.root;
const output = render(dom);

console.log('Input HTML:   ', input);
console.log('Output HTML:  ', output);

Prints:

Input HTML:    <span />content
Output HTML:   <span>content</span>

I'd expect output html to look either like input, or, at least like: <span></span>content (as span is not isVoidElement).


Here's a more complex example with `pre` that prompted me to create this issue. The `start` and `end` indices are messed up:
const { Parser, DomHandler } = require('htmlparser2');
const render = require('dom-serializer').default;

// Test case: self-closing tag followed by content
const input = '<pre><span />code with</pre> test';

// Parse with htmlparser2
const handler = new DomHandler({
  withStartIndices: true,
  withEndIndices: true,
})
const parser = new Parser(handler)
parser.end(input);

// Get the parsed DOM and render it back to HTML
const dom = handler.root;
const output = render(dom);

console.log('Input HTML:   ', input);
console.log('Output HTML:  ', output);

// recursively display start and end indices for each node
function displayIndices(node, level = 0) {
  const indent = ' '.repeat(level * 2);
  const originalContent = node.startIndex !== undefined ? input.slice(node.startIndex, node.endIndex + 1) : '';
  console.log(`${indent}Node: ${node.type}, Start: ${node.startIndex}, End: ${node.endIndex} - Content: "${originalContent}"`);

  if (node.children) {
    node.children.forEach(child => displayIndices(child, level + 1));
  }
}
// Display the start and end indices for each node
console.log('Node indices:');
dom.children.forEach(child => displayIndices(child));

Result:

Input HTML:    <pre><span />code with</pre> test
Output HTML:   <pre><span>code with</span></pre> test
Node indices:
Node: tag, Start: 0, End: 27 - Content: "<pre><span />code with</pre>"
  Node: tag, Start: 5, End: 27 - Content: "<span />code with</pre>"
    Node: text, Start: 13, End: 21 - Content: "code with"
Node: text, Start: 28, End: 32 - Content: " test"

note the span node: Node: tag, Start: 5, End: 27 - Content: "<span />code with</pre>"

Aivean avatar Mar 08 '25 21:03 Aivean

I have had the same issue in the paste and found no workaround, it would be awesome if this behavior could be fixed ! Since we are using this package via another one (sanitize-html) on a large scale we had to use another parser, to parse self closing tag properly, before passing it to sanitize-html. A lot of work for a rather evident bug.

srenoufd avatar Mar 25 '25 11:03 srenoufd

you can try set new Parser({...eventsHandler}, {xmlMode: true}) to fix self closing error.

molinla avatar May 28 '25 08:05 molinla

@molinla

you can try set new Parser({...eventsHandler}, {xmlMode: true}) to fix self closing error.

thanks, that is a good suggestion as a workaround (actually, it can be further reduced to recognizeSelfClosing : true, which is set by xmlMode).

however, the default "html mode" is still broken, so I'll leave the issue open

Aivean avatar Jun 01 '25 21:06 Aivean

Input HTML:    <span />content
Output HTML:   <span>content</span>

I get exactly same output if pass html to any browser:

document.body.innerHTML = "<span />content";
document.body.innerHTML; // returns "<span>content</span>"

Don't think that's an issue

DimaIT avatar Jun 24 '25 15:06 DimaIT

Self closing tag on non "void elements" were introduced by frameworks such as VueJS that recommends self closing tags for components, but also works for all HTML tags, in a Vue powered environnement.

I think that's what has led to this misunderstanding

srenoufd avatar Jun 25 '25 07:06 srenoufd

Don't think that's an issue

take a look at the second example I provided:

Input HTML:    <pre><span />code with</pre> test
Output HTML:   <pre><span>code with</span></pre> test
Node indices:
Node: tag, Start: 0, End: 27 - Content: "<pre><span />code with</pre>"
  Node: tag, Start: 5, End: 27 - Content: "<span />code with</pre>"
    Node: text, Start: 13, End: 21 - Content: "code with"
Node: text, Start: 28, End: 32 - Content: " test"

I'd expect from the parser library to parse and return exactly the content that it's given, instead of hallucinating extra closing tags

Aivean avatar Jun 26 '25 19:06 Aivean

take a look at the second example I provided

Once again result matches the browsers behavior.

I'd expect from the parser library to parse and return exactly the content that it's given, instead of hallucinating extra closing tags

I understand your frustration, but from an HTML standpoint, the parser is behaving correctly.

For your use case, you’ll need either some XML or flavored-HTML parser, or not a parser at all.

HTML tokenizer would do exactly what you need: read without making any structural changes. But it's lower level and you'll need to deal with any unexpected HTML structures yourself.

With your example:

import { Tokenizer } from 'htmlparser2';

const callbacks = {
    onopentagname() {
        console.log('onopentagname', ...arguments);
    },
    onselfclosingtag() {
        console.log('onselfclosingtag', ...arguments);
    },
    ontext() {
        console.log('ontext', ...arguments);
    },
    onend() {
        console.log('onend', ...arguments);
    },
};

const tokenizer = new Tokenizer({}, callbacks);
tokenizer.write(`<span />code with`);
tokenizer.end();


// Output:
// > onopentagname 1 5
// > onselfclosingtag 7
// > ontext 8 17
// > onend

DimaIT avatar Jun 30 '25 09:06 DimaIT