Sitemap-Generator-Crawler icon indicating copy to clipboard operation
Sitemap-Generator-Crawler copied to clipboard

Entity escaping is missing

Open ghost opened this issue 6 years ago • 5 comments

Currently only Ampersand (&) is entity escaped (&). Sitemap specification requires also single quote, double quote, GT and LT to be entity escaped:

Ampersand	&	&
Single Quote	'	'
Double Quote	"	"
Greater Than	>	>
Less Than	<	&lt;

This should be done to all the strings that are written into sitemap.xml.

ghost avatar Sep 09 '17 20:09 ghost

One would wonder why somebody would put those into a href...

Should be corrected regardless.

vezaynk avatar Sep 09 '17 20:09 vezaynk

The ampersand is used for GET parameters, it is definitely something that can appear on an HREF

studiosi avatar Sep 13 '17 07:09 studiosi

Here are the exact rules: https://www.sitemaps.org/protocol.html#escaping

All the stuff have to be entity escaped and URLs URL-escaped and encoded. Rules are very clear.

ghost avatar Sep 13 '17 08:09 ghost

I'm musing with the idea of parsing properly-encoded hrefs, letting cURL handle the weirdness and encode it all right before inserting it.

vezaynk avatar Sep 13 '17 14:09 vezaynk

If you encounter encoded and/or escaped URLs, you should decode and unescape them before adding them to crawl list.

All the text should be encode when it's written to sitemap but not before. Otherwise you will lose the uniqueness of URLs if you start encoding them while crawling. The encoding is needed only in sitemap and for sitemap.

ghost avatar Sep 13 '17 14:09 ghost