pjscrape icon indicating copy to clipboard operation
pjscrape copied to clipboard

Early suite exit

Open nrabinowitz opened this issue 13 years ago • 1 comments

Use case:

Let's call http://www.example.com/ as "root". "root" contains links to root.1, root.2, root.3...root.250 (see hermitageart.com...an actual example with 260 links!!!). Each of these 250 links contain links to other pages. If my feature of interest was found only in root.3 and root.102, then ideally I would have liked root.4, root.5,....root.250 to not be accessed, i.e. page.open should not be called on them.

I think this would need to be addressed by setting a flag (maybe on the _pjs.state object?) to end the suite early, which could be checked in the page completion callback, emptying out the array of still-to-scrape pages. Question: this only affects the current level of recursion. Is that good? Do we need an early exit from the entire suite?

nrabinowitz avatar Dec 15 '11 00:12 nrabinowitz

Better option here:

  • add a complete callback option, in the PhantomJS scope, with page as an argument
  • add page.manager as a pointer to the SuiteManager
  • add SuiteManager.endSuite() and SuiteManager.endAllSuites() to control flow.

It's relatively simple to end the current suite (set its urls array to []) and to end all suites (that, plus setting suites = []). Killing the ancestor of the current suite in a recursive situation might be more difficult - it's worth thinking about whether I'd want/need an actual tree structure to manage the suites if I wanted more fine-grained control.

nrabinowitz avatar Jan 04 '12 22:01 nrabinowitz