sitemap_generator icon indicating copy to clipboard operation
sitemap_generator copied to clipboard

output xml didn't follow spec in non ASCII character URI

Open lisbethw1130 opened this issue 4 years ago • 4 comments

As sitemap spec mentioned, the xml itself should do a xml entity escape, which the gem already have. But the url should first do the RFC-3986 standard for URIs or the RFC-3987 standard for IRIs, and xml entity escape at last. sitemap generator seems didn't follow RFC-3986 now.

add 'linkTestEntityEscape&<> and RFC3986ü中文' 
# output: <loc>https://website.test/linkTestEntityEscape&amp;&lt;&gt; and RFC3986ü中文</loc>
# should be: <loc>https://website.test/linkTestEntityEscape%26%3C%3E%20and%20RFC3986%C3%BC%E4%B8%AD%E6%96%87</loc>

add 'ü中文?aaa=bbb'
# output: <loc>https://website.test/ü中文?aaa=bbb</loc>
# should be: <loc>https://website.test/%C3%BC%E4%B8%AD%E6%96%87?aaa=bbb</loc>

can someone help me to check if my conclusion is right since I'm just a junior programmer and not sure it's right.

If everything is OK, a PR for this issue will be sent later.

Best Regards, Lisbeth

lisbethw1130 avatar Apr 20 '20 10:04 lisbethw1130

Anyone has the idea?

lisbethw1130 avatar Apr 29 '20 06:04 lisbethw1130

Hi @lisbethw1130 I think you're right. When I wrote this gem years ago it wasn't internationalized to handle UTF-8 and that wasn't as prevalent as it is today. It would be great if you could add that functionality, with tests :)

kjvarga avatar May 26 '20 06:05 kjvarga

Here's some obstacle I bumped in and solving:

  1. url escape can't be done in sitemap generator, so I wrote the tips in readme. e.g., we can't accurately split the query part and path part with a unescaped uri

https://example.com/dd?dd=?aa=vv can be https://example.com/dd%3Fdd=?aa=vv or https://example.com/dd?dd=%3Faa=vv

  1. Ruby doesn't escape single quote as xml spec mentioned, I just opened an issue in order to find out the real issue.

Any idea is welcome ;)

lisbethw1130 avatar May 29 '20 18:05 lisbethw1130

Awesome that the change was released in Ruby, @lisbethw1130! 🚀

olleolleolle avatar Mar 29 '23 12:03 olleolleolle