cheerio icon indicating copy to clipboard operation
cheerio copied to clipboard

Parsing self closing tag produce bad html

Open juanSanchezAlcala opened this issue 3 years ago • 4 comments

Self closing tag is been removed.

Example :

cheerio.load(`<div class="n-content-video n-content-video--youtube">
			<iframe src="https://www.youtube.com/?rel=0"/>
		</div>`).html()

This produces :

'<html><head></head><body><div class="n-content-video n-content-video--youtube"> <iframe src="https://www.youtube.com/?rel=0"> </div></iframe></div></body></html>'

As you can see closing slash from iframe has disappeared producing bad html syntax

juanSanchezAlcala avatar Aug 03 '22 08:08 juanSanchezAlcala

the way to resolve this is using xmlMode

cheerio.load(`<div class="n-content-video n-content-video--youtube">
			<iframe src="https://www.youtube.com/?rel=0"/>
		</div>`,{recognizeSelfClosing : true}).html()

But even if we use this configuration de DOM tree is not well represented

juanSanchezAlcala avatar Aug 03 '22 08:08 juanSanchezAlcala

with recognizeSelfClosing option produces bad html syntax also.


'<html><head></head><body><div class="n-content-video n-content-video--youtube">
			<iframe src="https://www.youtube.com/?rel=0">
		</div></iframe></div></body></html>'

juanSanchezAlcala avatar Aug 03 '22 08:08 juanSanchezAlcala

The original behaviour is in line with how browsers work: Try it in yours.

fb55 avatar Aug 03 '22 09:08 fb55

The problem comes from transforming from xml to html

 cheerio.load(cheerio.load(`<div class="n-content-video n-content-video--youtube">
			<iframe src="https://www.youtube.com?rel=0"></iframe>
		</div><div><div class="n-recommended"></div></div>`).xml()).html()

it produces


'<html><head></head><body><div class="n-content-video n-content-video--youtube">
			<iframe src="https://www.youtube.com?rel=0">
		</div><div><div class="n-recommended"/></div></body></html></iframe></div></body></html>'

Shouldn't be the output text the same as the input ?

juanSanchezAlcala avatar Aug 03 '22 11:08 juanSanchezAlcala

When I put code (from above) into Chrome browser I got:

<html><head></head><body><div class="n-content-video n-content-video--youtube">
<iframe src="https://www.youtube.com?rel=0"></iframe>
</div><div><div class="n-recommended"></div></div>
</body></html>

so browser actually closes iframe before it's parent element div

maybe you should decode content first, so you can avoid this "repairing" functionality.

// decode self closed tags as fragment
const decodedHTML = cheerio.load(selfclosedHTML, { xmlMode: true }, false).html({ xmlMode: false });

// and now use it as regular 
console.info(cheerio.load(decodedHTML).html());

result:

<html><head></head><body><div class="n-content-video n-content-video--youtube">
<iframe src="https://www.youtube.com?rel=0"></iframe>
</div><div><div class="n-recommended"></div></div></body></html>

5saviahv avatar Aug 14 '22 12:08 5saviahv