crawly icon indicating copy to clipboard operation
crawly copied to clipboard

Add visited URLs in `Crawly.Response.t`

Open tanguilp opened this issue 3 years ago • 4 comments

When a request is being redirected and redirection is handled by the fetcher directly, it is necessary to add both the initial and the landing URL to the list of visited URLs.

It is possible using both the Crawly.Request{} and Crawly.Response{} structs when there's only one redirect.

However, when there's more, for example link A leads to A -> B -> C -> D redirections, we're losing information. I suggest a :visited_urls field to be added to theCrawly.Response{} struct.

At first glance HTTPoison wouldn't support it, but I'm thinking of a CDP driver. There can even be some javascript redirects (seen in the wild on scraping-worth sites).

tanguilp avatar Dec 04 '20 12:12 tanguilp

@tanguilp could you clarify what you mean by a cdp driver?

Ziinc avatar Dec 04 '20 14:12 Ziinc

Oh, sorry, Chrome DevTools Protocol.

Sites are more and more using ways to detect headless browsers, and using a real browser is sometimes the only alternative.

To use it, you send a command to Chrome like "Open this URL in this tab" and wait for events such as "main page received", "redirect", "DOM content loaded", etc. When you're happy with the response, either you parse the received payload of the main page, or you load the DOM and convert it to HTML. It allows, like headless browser, to wait for javascript to load some items.

tanguilp avatar Dec 04 '20 15:12 tanguilp

I think let's narrow down this issue to the ability to track url redirections through a form of url history.

Currently hackney does not store intermediate locations (https://github.com/benoitc/hackney#automatically-follow-a-redirection), so a possible method to solve this is to handle redirects on the Crawly side instead.

Ziinc avatar Dec 10 '20 08:12 Ziinc

That could be a solution for Hackney indeed. Another solution would be to write a wraper for it doing the work of saving the URL history. But IMO that's a defect of this library because it's losing useful information.

Also not that for browser-based fetchers, the browser follows the redirection following its own rules and there's no way to stop it (for instance to prevent the browser from following a 302). The only thing one can do (at least with the CDP protocol) is to receive such events and store them while the page is redirected to its final destination (or another URL which will trigger a redirect again).

tanguilp avatar Dec 10 '20 10:12 tanguilp