js-crawler
js-crawler copied to clipboard
Can we promisfy js-crawler
Can you help me show, if we can promisfy the below js-crawler. is there better way to return the response from each crawler state.
function runCrawler(url) {
var crawler = new Crawler().configure({ignoreRelative: false, depth: 2});
crawler.crawl({
url: url,
success: function (page) {
console.log(page.url +' --- ' + page.status);
},
failure: function (page) {
console.log(page.url +' --- ' + page.status);
},
finished: function (page) {
return console.log('COMPLETED***********');
}
});
Hi,
Actually the result of invoking crawl
is not a Promise, but a natural Observable http://reactivex.io/documentation/observable.html, it even has a very similar API, maybe we can think about making it an actual Observable in the later releases.
Promise resolves to a single value and crawler produces a series of values the length of which we do not know upfront, so we cannot just return a Promise from the crawl
method.
I agree that it would be nice to think about some alternatives to callback-based API.
I don't think promisifying the crawler API itself really makes sense, based on the fact that it is a many-result not a single-result system... it is better suited to the event listener system in place currently. (Re: antivanov's comment.)
What I did to adapt it to my promise-based system, for those interested:
function startCrawl(url) {
return new Promise(function(resolve) {
// Create new results object
let results = {
};
crawler.crawl({
url: url,
success: function(page) {
// Do any actions you wanted with page, log to console if you don't care about output order, etc.
// Add whatever you want to keep to your results object
},
failure: function(page) {
// Do any actions you wanted with page, log to console if you don't care about output order, etc.
// Add whatever you want to keep to your results object
},
finished: function(crawledUrls) {
// Do any actions you wanted with list of crawled urls, log to console if you want, etc.
// Add whatever you want to keep to your results object
resolve(results);
}
});
});
}
Basically, the "results" object would be used to contain anything you want to pass back after the promise resolves, and the promise resolves after the crawling completes. You could add a reject() to handle errors as well, if you want the promise to fail in certain cases.
To use this, it is as simple as any other promise/thenable function:
startCrawl("https://example.com").then(function(results) {
// Do something with results object here...
});