crawlee icon indicating copy to clipboard operation
crawlee copied to clipboard

Documentation for mixed mode crawls

Open corford opened this issue 4 years ago • 7 comments

The SDK examples for each crawler type (Basic, Cheerio, Puppeteer) assume running an entire crawl with just that crawler. In many cases, it is often desirable to use a browser for specific URLs/sequences on a site (e.g. login) and then swap to a less heavy approach for scraping the remaining resources (e.g. REST endpoints that require a cookie obtained after successful login).

It would be great if the Apify SDK docs had an official example covering this use case.

corford avatar Dec 10 '19 11:12 corford

Yeah, that certainly makes sense. @metalwarrior665 would you by chance have any code that we could quickly turn into an example?

mnmkng avatar Dec 10 '19 22:12 mnmkng

I were searching as well some documentation about mixing crawlers for my usecase :

Is it possible to use Apify for :

  1. grab a static list of URLs to crawl => i guess a RequestList ?
  2. crawl them all in concurrency with PuppeteerCrawler ?
  3. for each url, i have sometimes login to handle => how to persist auth key or cookie ?
  4. for each url, i have also multiple "categories" to crawl (only as logged user) => use a RequestQueue ?
  5. for each url/category, how to pass the htmlSelectors to the crawler.pageFunction ?

How to mix all that together in a backend service class, returning the collected data ? sounds similar as @corford 's usecase ?

Sharlaan avatar Dec 12 '19 08:12 Sharlaan

@Sharlaan I feel you don't necessarily need the approach of mixing Crawler types and passing data from one to another. Although you could use CheerioCrawler for non-login pages.

To answer your questions (probably better as a standalone issue):

  1. Yes, that is a RequestList job
  2. Sure or CheerioCrawler
  3. You can store them in a global variable and persist to Key Value store.
  4. Yes, you can use both RequestList and RequestQueue in the same Crawler and it works as you expect.
  5. You don't need to pass them in any way, you just make a scraping function where they are hardcoded. Of course, you can read the selectors from the input and pass them around but we almost never do that (since changing code is fast anyway).

If you have more users to log in, I recommend doing a separate run for each of them. Since you log in for the whole browser, it gets messy if you want to have multiple users concurrently in the same Crawler (can be done but it is complicated).

@mnmkng @corford I would rather just come up with simple examples than reviewing some dirty old code :)

One idea though: When we run such use-cases where we pass data through multiple crawlers on Apify platform, we use more actors for that and store the intermediate results. I feel that it is a cleaner approach. But without Apify platform you don't have this luxury :)

I would assume you want to run them sequentially. Running Crawlers in parallel is possible but it will be a mess in the logs. There is no way to simply use just one RequestQueue since it will always want to go through all Requests.

So you can either create a global array (you have to persist the array for migration if running on Apify platform) where you store the URLs gathered from the first crawler and then create new RequestList for the next crawler. The second option is to use a separate RequestQueue for each Crawler but the pain is you have to create temporary named ones and delete them afterward (not a huge deal but annoying).

metalwarrior665 avatar Dec 12 '19 17:12 metalwarrior665

@metalwarrior665 Thanks for the thoughtful reply. The example I was envisaging was more around how to correctly manage state and interaction between the two crawlers without involving a dependency on Actors/Apify platform.

The new Session and SessionPool concepts feel like they might help with this?

In terms of a concrete example: imagine a site that has REST endpoints for the data you want but to get to these you need an authorised session cookie. This cookie is only set after successful login and passing a client side javascript anti-bot check. In this case, I'd want to do the login and JS check with puppeteer (out of band from the main RequestQueue) and then I'd like to crawl the REST endpoints with the newly obtained cookie using Cheerio (leveraging the RequestQueue to direct the crawl e.g. start with a category endpoint, enumerate all sub-category urls, throw them on the queue etc. until I get down to individual product endpoints). If something goes wrong during the crawl (e.g. session expires and needs re-authing), I'd like to call back out to Puppeteer to get that done and then resume crawling with Cheerio.

There are a hundred different ways to go about the above but it would be very helpful to have a solid example from Apify demonstrating some best practices around the core parts/glue (e.g. mechanism to transfer cookies between crawlers; controlling/synchronising which crawler does what - and when; running two RequestQueues if necessary; elegant non-hacky way to "call out" to Puppeteer from the main Cheerio crawl; pitfalls to avoid; caveats etc.).

In Scrapy land you might do something like use a downloader middleware that calls out to a custom backend (that itself uses requests lib to scrape the REST endpoints and Selenium+chrome for the login dance). With Apify and its first class support for puppeteer, it feels like the whole thing could be accomplished without needing to leave the framework.

corford avatar Dec 12 '19 19:12 corford

@corford So for this case, I would personally use just a single BasicCrawler and you will get rid of all the problems with synchronizing 2 crawlers.

For Cheerio type of request, you simply add 2 lines of code (few more maybe for error handling)

  1. requestAsBrowser call
  2. require standalone Cheerio for parsing

For Puppeteer you have 2 options:

  1. Simple but less performant - Open a separate browser with Apify.launchPuppeteer for each page. Makes sure cookies, login is distinct to each browser.
  2. Use some pool of browsers/pages, more performant since a page is cheaper than a browser - You can try PuppeteerPool for this. Don't forget that the struggle with Puppeteer is that proxy or cookies are browser-based, not page-based. Solution 1 solves this easily.

Also, don't forget to clean up for open browsers.

metalwarrior665 avatar Dec 16 '19 10:12 metalwarrior665

@metalwarrior665 sadly requestAsBrowser wouldn't be viable for the use case I presented (JS fingerprinting payload that needs to be executed and the results POSTed back to the origin in order to acquire a session token for access to REST endpoints).

Invoking a browser from within a Basic (or Cheerio) crawler by using Apify.launchPuppeteer looks closer to a solution. I'll investigate that.

corford avatar Jan 02 '20 10:01 corford

I meant you can mix those solutions as you need. You can launchPuppeteer, do some auth, save cookies, close the browser and then use them with requestAsBrowser.

metalwarrior665 avatar Jan 06 '20 11:01 metalwarrior665