doogle icon indicating copy to clipboard operation
doogle copied to clipboard

Bug: Crawling non-ASCII characters (URL)

Open safesploit opened this issue 2 years ago • 1 comments

When crawling the Japanese Wikipedia ja.wikipedia.org/wiki/メインページ the following URL is indexed https://ja.wikipedia.org/wiki/%E3%83%A1%E3%82%A4%E3%83%B3%E3%83%9A%E3%83%BC%E3%82%B8

safesploit avatar Nov 15 '22 23:11 safesploit

Hey, I've written this up and it works, but am I missing anything?

Tested and it functions fine, tested a url with a ` character(only thing not covered by htmlspecialchars) and it didn't break it

I've also noticed that html tags are removed from URL titles(if title says "<b>Hi" it results in "Hi", which kindof is an issue depending on the circumstance, I'd rather it be processed with htmlspecialchars than removed. Anyway,

Line 88 of crawl-manual insert $url = htmlspecialchars(urldecode($url),ENT_QUOTES, "UTF-8");

dehlirious avatar Mar 31 '23 06:03 dehlirious