js-crawler icon indicating copy to clipboard operation
js-crawler copied to clipboard

Can we promisfy js-crawler

Open shekarls opened this issue 8 years ago • 2 comments

Can you help me show, if we can promisfy the below js-crawler. is there better way to return the response from each crawler state.

function runCrawler(url) {

            var crawler = new Crawler().configure({ignoreRelative: false, depth: 2});
            crawler.crawl({
                url: url,
                success: function (page) {
                    console.log(page.url +' --- ' + page.status);
                   },

                failure: function (page) {
                    console.log(page.url +' --- ' + page.status);

                },
                finished: function (page) { 
                    return console.log('COMPLETED***********');
                } 
        });

shekarls avatar Aug 09 '16 23:08 shekarls

Hi,

Actually the result of invoking crawl is not a Promise, but a natural Observable http://reactivex.io/documentation/observable.html, it even has a very similar API, maybe we can think about making it an actual Observable in the later releases.

Promise resolves to a single value and crawler produces a series of values the length of which we do not know upfront, so we cannot just return a Promise from the crawl method.

I agree that it would be nice to think about some alternatives to callback-based API.

amoilanen avatar Aug 10 '16 21:08 amoilanen

I don't think promisifying the crawler API itself really makes sense, based on the fact that it is a many-result not a single-result system... it is better suited to the event listener system in place currently. (Re: antivanov's comment.)

What I did to adapt it to my promise-based system, for those interested:

function startCrawl(url) {
  return new Promise(function(resolve) {
    // Create new results object
    let results = {
    };
  
    crawler.crawl({
      url: url,
      success: function(page) {
        // Do any actions you wanted with page, log to console if you don't care about output order, etc.
        // Add whatever you want to keep to your results object
      },
      failure: function(page) {
        // Do any actions you wanted with page, log to console if you don't care about output order, etc.
        // Add whatever you want to keep to your results object
      },
      finished: function(crawledUrls) {
        // Do any actions you wanted with list of crawled urls, log to console if you want, etc.
        // Add whatever you want to keep to your results object
        resolve(results);
      }
    });
  });
}

Basically, the "results" object would be used to contain anything you want to pass back after the promise resolves, and the promise resolves after the crawling completes. You could add a reject() to handle errors as well, if you want the promise to fail in certain cases.

To use this, it is as simple as any other promise/thenable function:

startCrawl("https://example.com").then(function(results) {
  // Do something with results object here...
});

jankcat avatar Feb 15 '17 16:02 jankcat