sitemap_generator
sitemap_generator copied to clipboard
output xml didn't follow spec in non ASCII character URI
As sitemap spec mentioned, the xml itself should do a xml entity escape, which the gem already have. But the url should first do the RFC-3986 standard for URIs or the RFC-3987 standard for IRIs, and xml entity escape at last. sitemap generator seems didn't follow RFC-3986 now.
add 'linkTestEntityEscape&<> and RFC3986ü中文'
# output: <loc>https://website.test/linkTestEntityEscape&<> and RFC3986ü中文</loc>
# should be: <loc>https://website.test/linkTestEntityEscape%26%3C%3E%20and%20RFC3986%C3%BC%E4%B8%AD%E6%96%87</loc>
add 'ü中文?aaa=bbb'
# output: <loc>https://website.test/ü中文?aaa=bbb</loc>
# should be: <loc>https://website.test/%C3%BC%E4%B8%AD%E6%96%87?aaa=bbb</loc>
can someone help me to check if my conclusion is right since I'm just a junior programmer and not sure it's right.
If everything is OK, a PR for this issue will be sent later.
Best Regards, Lisbeth
Anyone has the idea?
Hi @lisbethw1130 I think you're right. When I wrote this gem years ago it wasn't internationalized to handle UTF-8 and that wasn't as prevalent as it is today. It would be great if you could add that functionality, with tests :)
Here's some obstacle I bumped in and solving:
- url escape can't be done in sitemap generator, so I wrote the tips in readme. e.g., we can't accurately split the query part and path part with a unescaped uri
https://example.com/dd?dd=?aa=vv
can be https://example.com/dd%3Fdd=?aa=vv
or https://example.com/dd?dd=%3Faa=vv
- Ruby doesn't escape single quote as xml spec mentioned, I just opened an issue in order to find out the real issue.
Any idea is welcome ;)
Awesome that the change was released in Ruby, @lisbethw1130! 🚀