Sitemap-Generator-Crawler
Sitemap-Generator-Crawler copied to clipboard
Entity escaping is missing
Currently only Ampersand (&) is entity escaped (&). Sitemap specification requires also single quote, double quote, GT and LT to be entity escaped:
Ampersand & &
Single Quote ' '
Double Quote " "
Greater Than > >
Less Than < <
This should be done to all the strings that are written into sitemap.xml.
One would wonder why somebody would put those into a href...
Should be corrected regardless.
The ampersand is used for GET parameters, it is definitely something that can appear on an HREF
Here are the exact rules: https://www.sitemaps.org/protocol.html#escaping
All the stuff have to be entity escaped and URLs URL-escaped and encoded. Rules are very clear.
I'm musing with the idea of parsing properly-encoded hrefs, letting cURL handle the weirdness and encode it all right before inserting it.
If you encounter encoded and/or escaped URLs, you should decode and unescape them before adding them to crawl list.
All the text should be encode when it's written to sitemap but not before. Otherwise you will lose the uniqueness of URLs if you start encoding them while crawling. The encoding is needed only in sitemap and for sitemap.