supercrawler icon indicating copy to clipboard operation
supercrawler copied to clipboard

Script doesn't exit when using RedisUrlList

Open simoncpu opened this issue 8 years ago • 4 comments

Problem: Script doesn't exit when using RedisUrList.

Step to replicate: Run the following code:

'use strict';

const supercrawler = require('supercrawler');
const crawler = new supercrawler.Crawler({
    urlList: new supercrawler.RedisUrlList({
        redis: {
            host: 'redis-server.example.org'
        }
    })
});

console.log('Script should exit after this.');

Expected behavior: Script should stop after running.

Actual behavior: Script runs indefinitely.

Workaround: Call process.exit() to terminate the script.

BTW, I'm using AWS ElastiCache for Redis, just in case this detail is needed. :)

simoncpu avatar Oct 09 '17 15:10 simoncpu

If I understand correctly, this is by design. The process will wait until further URLs are available for crawling. There could be no URLs in the queue for three reasons:

(a) The queue is empty, in which case it waits until a URL is added. In a distributed set-up, this could be added by another script/tool.

(b) The queue only has URLs that are errored and waiting for a retry. Failed URLs are tried using exponential backoff.

(c) Even successful URLs will be recrawled after 30 days (configurable with expiryTimeMs).

Since the crawl will never end, I would expect the process to continue.

You can listen to the urllistempty event to detect when the queue is empty and call crawler.stop(). This should stop the script once the currently-crawled URLs are finished.

brendonboshell avatar Oct 09 '17 16:10 brendonboshell

On second thoughts, if you have not called start(), it should exit. This is probably because we do not disconnect from redis. I will take a look at this.

brendonboshell avatar Oct 09 '17 16:10 brendonboshell

Yepp, I've tried disabling keepAlive in ioredis and calling crawl.stop() without calling crawl.start(), and the script still doesn't exit. Our use case is for a separate script (preferably in AWS Lambda*) to listen for new URLs and push them to supercrawler via crawler.getUrlList().insertIfNotExists().

* I initially assumed that this timed out in Lambda due to Bluebird, but turns out to be caused by Redis.

simoncpu avatar Oct 09 '17 16:10 simoncpu

Ah... process.exit() has also been recommended by the guys at ioredis. I guess we'll just use this workaround. :)

simoncpu avatar Oct 09 '17 16:10 simoncpu