Incorrect and inconsistent behavior of self-closing tags
"htmlparser2": "^10.0.0",
Minimal repro:
const { Parser, DomHandler } = require('htmlparser2');
const render = require('dom-serializer').default;
// Test case: self-closing tag followed by content
const input = '<span />content';
// Parse with htmlparser2
const handler = new DomHandler();
const parser = new Parser(handler);
parser.end(input);
// Get the parsed DOM and render it back to HTML
const dom = handler.root;
const output = render(dom);
console.log('Input HTML: ', input);
console.log('Output HTML: ', output);
Prints:
Input HTML: <span />content
Output HTML: <span>content</span>
I'd expect output html to look either like input, or, at least like: <span></span>content (as span is not isVoidElement).
Here's a more complex example with `pre` that prompted me to create this issue. The `start` and `end` indices are messed up:
const { Parser, DomHandler } = require('htmlparser2');
const render = require('dom-serializer').default;
// Test case: self-closing tag followed by content
const input = '<pre><span />code with</pre> test';
// Parse with htmlparser2
const handler = new DomHandler({
withStartIndices: true,
withEndIndices: true,
})
const parser = new Parser(handler)
parser.end(input);
// Get the parsed DOM and render it back to HTML
const dom = handler.root;
const output = render(dom);
console.log('Input HTML: ', input);
console.log('Output HTML: ', output);
// recursively display start and end indices for each node
function displayIndices(node, level = 0) {
const indent = ' '.repeat(level * 2);
const originalContent = node.startIndex !== undefined ? input.slice(node.startIndex, node.endIndex + 1) : '';
console.log(`${indent}Node: ${node.type}, Start: ${node.startIndex}, End: ${node.endIndex} - Content: "${originalContent}"`);
if (node.children) {
node.children.forEach(child => displayIndices(child, level + 1));
}
}
// Display the start and end indices for each node
console.log('Node indices:');
dom.children.forEach(child => displayIndices(child));
Result:
Input HTML: <pre><span />code with</pre> test
Output HTML: <pre><span>code with</span></pre> test
Node indices:
Node: tag, Start: 0, End: 27 - Content: "<pre><span />code with</pre>"
Node: tag, Start: 5, End: 27 - Content: "<span />code with</pre>"
Node: text, Start: 13, End: 21 - Content: "code with"
Node: text, Start: 28, End: 32 - Content: " test"
note the span node: Node: tag, Start: 5, End: 27 - Content: "<span />code with</pre>"
I have had the same issue in the paste and found no workaround, it would be awesome if this behavior could be fixed ! Since we are using this package via another one (sanitize-html) on a large scale we had to use another parser, to parse self closing tag properly, before passing it to sanitize-html. A lot of work for a rather evident bug.
you can try set new Parser({...eventsHandler}, {xmlMode: true}) to fix self closing error.
@molinla
you can try set
new Parser({...eventsHandler}, {xmlMode: true})to fix self closing error.
thanks, that is a good suggestion as a workaround (actually, it can be further reduced to recognizeSelfClosing : true, which is set by xmlMode).
however, the default "html mode" is still broken, so I'll leave the issue open
Input HTML: <span />content Output HTML: <span>content</span>
I get exactly same output if pass html to any browser:
document.body.innerHTML = "<span />content";
document.body.innerHTML; // returns "<span>content</span>"
Don't think that's an issue
Self closing tag on non "void elements" were introduced by frameworks such as VueJS that recommends self closing tags for components, but also works for all HTML tags, in a Vue powered environnement.
I think that's what has led to this misunderstanding
Don't think that's an issue
take a look at the second example I provided:
Input HTML: <pre><span />code with</pre> test
Output HTML: <pre><span>code with</span></pre> test
Node indices:
Node: tag, Start: 0, End: 27 - Content: "<pre><span />code with</pre>"
Node: tag, Start: 5, End: 27 - Content: "<span />code with</pre>"
Node: text, Start: 13, End: 21 - Content: "code with"
Node: text, Start: 28, End: 32 - Content: " test"
I'd expect from the parser library to parse and return exactly the content that it's given, instead of hallucinating extra closing tags
take a look at the second example I provided
Once again result matches the browsers behavior.
I'd expect from the parser library to parse and return exactly the content that it's given, instead of hallucinating extra closing tags
I understand your frustration, but from an HTML standpoint, the parser is behaving correctly.
For your use case, you’ll need either some XML or flavored-HTML parser, or not a parser at all.
HTML tokenizer would do exactly what you need: read without making any structural changes. But it's lower level and you'll need to deal with any unexpected HTML structures yourself.
With your example:
import { Tokenizer } from 'htmlparser2';
const callbacks = {
onopentagname() {
console.log('onopentagname', ...arguments);
},
onselfclosingtag() {
console.log('onselfclosingtag', ...arguments);
},
ontext() {
console.log('ontext', ...arguments);
},
onend() {
console.log('onend', ...arguments);
},
};
const tokenizer = new Tokenizer({}, callbacks);
tokenizer.write(`<span />code with`);
tokenizer.end();
// Output:
// > onopentagname 1 5
// > onselfclosingtag 7
// > ontext 8 17
// > onend