pjscrape icon indicating copy to clipboard operation
pjscrape copied to clipboard

preSuite(page) function

Open nrabinowitz opened this issue 12 years ago • 9 comments

Add an option for a preSuite function, in the PhantomJS environment, with page passed in as an argument, to support things like session-based authentication before a scrape

nrabinowitz avatar Dec 12 '11 05:12 nrabinowitz

Any plans for this? Think its a big gap not have control over the phantomjs objects. I can help!

macedd avatar Jun 15 '12 14:06 macedd

I agree - I just haven't had much time for this project lately. If you'd like to contribute, please fork and submit a pull request - that would be great!

nrabinowitz avatar Jun 15 '12 16:06 nrabinowitz

I take a deep look at the code and it is not very clear to me how to easy accomplish this "preSuite", which would allow form authentication for example. If you have a clue please let me know.

macedd avatar Jun 17 '12 07:06 macedd

My thought was that this would be pretty simple - just a hook for an arbitrary function to run before the suites started, say here, passing in SuiteManager.getPage(). This would allow arbitrary pre-suite automation, e.g. logging in, using the WebPage object that will then get used in the scraper suites.

nrabinowitz avatar Jun 18 '12 16:06 nrabinowitz

Hi nrabinowitz, thank you for still care this great piece of software. I take in consideration your point, studyng a bit more of phantom/pjscrape , and maybe we aren't on the track yet with "preSuite".

A login process with phantom is a step-by-step waitFor-like action (http://groups.google.com/group/phantomjs/browse_thread/thread/db4cfc37caf0213c#) Also this steps should be inside the opened page (WebPage.open()) because its there cookies and session exists (cannot confirm with documentation, only from examples). Then, in the current implementation of pjscrape, with moreUrls and Suites beeing all page opened (and losing context) we cannot bind authentication sessions to the scrape in a straightforward manner.

Fortunately setting up the WebPage objects may be as simple as implemented on the pageSettings pull (can be improved), but things like step-by-step navigation or a login isn't that simple (IMHO).

So in my view we must refactor the code allowing use cases like these we are thinking on and also implementing these new tools to the library.

macedd avatar Jun 19 '12 06:06 macedd

Ok, I see the point that it needs to handle an asynchronous process. But it doesn't make any sense for it to happen within WebPage.open(), because that would restrict it to a single page, and what's needed is a multiple-page process. I haven't tested it, but I'm pretty sure cookies and session are attached to the WebPage object and will survive multiple open calls (otherwise what's the point?).

I'm not in favor of a big refactor at this point, and I don't think I see the requirement. I think what we need is preSuite(page, callback) - this allows you to do whatever async initialization you want, then invoke the callback when you're done. The callback would then kick off the suite runner and run as usual.

nrabinowitz avatar Jun 20 '12 16:06 nrabinowitz

I exposed getPage in the pjs namespace and disabled the pjs.init() in pjscrape.

This quick hack allowed me to retrieve the shared webpage object, do some custom authentication and start the pjs.init process myself once my custom needs where met. So far, it looks like it'll do the trick for my needs.

I like the idea of something more elegant like the preSuite(page, callback) - though it is beyond me at the moment :)

donl avatar Oct 21 '12 05:10 donl

@donl do you still have the modified script pjscrape.js available and could you publish it ? thx

devloic avatar Jul 08 '14 12:07 devloic

has there been any forward progress on this? I'm also interested in scraping a page that requires login credentials and would appreciate any advice or guidance!

nathanielrindlaub avatar Jan 29 '17 22:01 nathanielrindlaub