OpenWPM icon indicating copy to clipboard operation
OpenWPM copied to clipboard

Document how to implement custom spidering strategies

Open englehardt opened this issue 6 years ago • 6 comments

OpenWPM currently has a browse command which selects a few internal links off of the current page and follows them. We should re-think this command. Ideally we'd be able to have the crawler execute an arbitrary link selection and following strategy.

englehardt avatar Oct 10 '19 04:10 englehardt

We should aggregate links and then click them.

Since clicks might not actually lead to navigations, we'll probably need to anticipate a navigation and time out if we don't get one, at which point we might need to reset page state to be able to interact with the rest of the links.

Ideally someone's already figured out a good way to do this so we can reuse something.

nhnt11 avatar Nov 13 '19 09:11 nhnt11

The problem that I faced when trying to implement something similar was that the timeout of each command has to be defined before the command is executed and that it cannot be updated once the command sequence is started. Scenario: I wanted to crawl all links on a website and visit them afterwards. Technically that is not a problem: (1) visit the page, (2) crawl all links, and (3) visit each link. For each of these tasks there are already functions present that do that. However, there is no reasonable way to define a timeout as one does not know how many links might be on the page. I ended up to save all links to file and to crawl them in a second run as I did not want to define very long timeouts.

turban1988 avatar Dec 20 '19 09:12 turban1988

@skim1102

vringar avatar Apr 15 '20 19:04 vringar

In the new command model, it will be easier for users to implement this themselves. But it would help to have different built-in strategies.

englehardt avatar Nov 09 '20 14:11 englehardt

I don't think that we can provide much in the way of a generic spidering strategy since this will heavily depend on the use-case. I would instead suggest putting my gclid crawling code up as a tutorial (either in the docs or on my website) and link to that. This way we show people how to do what they want and slim down our code base.

vringar avatar Nov 11 '20 12:11 vringar

Looking back at turban1988's comment and other issues on this repository, we should document why shared state between the TaskManager/main process and the BrowserManagers is impossible. And then provide an example using Redis or something similiar on how to effectively recursively crawl.

vringar avatar Dec 21 '21 17:12 vringar