pjscrape
pjscrape copied to clipboard
Early suite exit
Use case:
Let's call http://www.example.com/ as "root". "root" contains links to root.1, root.2, root.3...root.250 (see hermitageart.com...an actual example with 260 links!!!). Each of these 250 links contain links to other pages. If my feature of interest was found only in root.3 and root.102, then ideally I would have liked root.4, root.5,....root.250 to not be accessed, i.e. page.open should not be called on them.
I think this would need to be addressed by setting a flag (maybe on the _pjs.state object?) to end the suite early, which could be checked in the page completion callback, emptying out the array of still-to-scrape pages. Question: this only affects the current level of recursion. Is that good? Do we need an early exit from the entire suite?
Better option here:
- add a
completecallback option, in the PhantomJS scope, withpageas an argument - add
page.manageras a pointer to the SuiteManager - add
SuiteManager.endSuite()andSuiteManager.endAllSuites()to control flow.
It's relatively simple to end the current suite (set its urls array to []) and to end all suites (that, plus setting suites = []). Killing the ancestor of the current suite in a recursive situation might be more difficult - it's worth thinking about whether I'd want/need an actual tree structure to manage the suites if I wanted more fine-grained control.