colly icon indicating copy to clipboard operation
colly copied to clipboard

Better URL parsing according to whatwg URL standard

Open WGH- opened this issue 3 years ago • 2 comments

As of now, Colly parses URLs with Go stdlib's net/url parser. This parser is somewhat simple, and doesn't do some quirks that browsers do. Since Colly is a web crawling framework, in order to be able to handle all weird stuff that happens in the WWW, it should better follow the browsers here.

Fortunately, there's a web standard that codifies the quirks: https://url.spec.whatwg.org/#url-parsing

I'll give a few examples that net/url parser handles incorrectly:

  • Literal tabs and newlines should be stripped from the input, even if they appear in the middle of the input. I attempted to fix this in #589.
  • Non-ASCII characters are automatically percent-encoded. For example, <a href="/?тест">foo</a> and <a href="/?%D1%82%D0%B5%D1%81%D1%82">bar</a> both lead to the same location, and HTTP request on wire would be GET /?%D1%82%D0%B5%D1%81%D1%82 HTTP/1.1 in both cases (this assumes UTF-8 encoding). Note that simply percent-encoding the input is wrong, as it would lead to double-encoded string when the input is already ok.

I have found this Go library: https://github.com/nlnwa/whatwg-url. This library doesn't appear to be popular, but it has rather large test suite borrowed from https://github.com/web-platform-tests/wpt/tree/master/url which it seems to pass.

Rather than implementing these URL parsing quirks one by one in Colly and duplicating efforts, I think I'll check out that library, maybe it's perfect fit for us, and report back with results.

What do you think?

WGH- avatar Mar 21 '21 21:03 WGH-

@WGH- your points are valid, I totally agree. There is no need to reinvent the wheel. I have to look at whatwg-url in more detail, but it looks promising at first glance.

asciimoo avatar Mar 23 '21 16:03 asciimoo

I am using version 2.1.0 and faced a similar problem: if there are special characters in the link that I am trying to get data from for further processing, then the request will be executed with encoded characters. Not all sites understand this correctly and return 404.

For example, I have the address of the page: https://example.com/some's-page-path The request will be sent to: https://example.com/some%27s-page-path

And since the server does not handle such cases, it will return 404.

I agree that this is a mistake on the side of a particular site, but there is no way to influence it, and i need to parse data 😇

Is there any solution to this problem now?

fussraider avatar Dec 08 '23 13:12 fussraider