bull
bull copied to clipboard
Possible to kill a failed job thread?
I have a task that takes a long time to complete (up to 5 minutes) and deals with streaming a large amount of data to and from the network.
When an error is emitted from my sandboxed job, bull will immediately move the job to the failed set. However, some async tasks continue to work, which needlessly consumes computing resources since I don't care about the result of a failed job.
Is there a way to just kill the failed job thread? I'm not interested in gracefully stopping the failed job, I just want it to stop using resources immediately.
It is not possible yet, but the implementation of such a feature for sandboxed processors is pretty easy. I will mark this issue as an enhacement.
I am thinking about 2 alternatives: 1) killing the process as soon as the job fails, 2) having an extra method in the job object that is passed to the processor, job.kill()
, that you need to call explicitly to kill the process. The advantage of 1) is that it will work even in cases that you have not considered in your code, however I am still not sure that it may ba an undesirable behaviour in some cases.
hmm, actually you should be able to use process.exit([code])
already (https://nodejs.org/api/process.html#process_process_exit_code), so the question is if point 1 is really a good thing to have or not, it could be an optional setting.
I experimented a bit with process.exit
and after reading deeper in the code I don't think it's a solution to my issue.
When a retained child process exits, bull removes it from the set of retained/free child processes, but it doesn't handle the re-creation of the child process for future use. Therefore subsequent jobs don't reach the dead child process, and my intense job is only ever run once. This happens when you call process.exit()
within the process function regardless of whether the job fails or succeeds.
IMO the best solution here is to treat child processes like a generic pool, which also acts as a lock/concurrency manager. The benefit to using the { create, destroy, validate }
interface for managing a pool of child processes is that you can write a validate
function which fails when it detects the child process is dead, and re-create a child process to be added to the pool before calling any methods. This ensures that there is always a valid resource available to use, and gives users the freedom to kill their jobs as they see fit without risk of memory leak.
@manast What do you think? If that sounds like an acceptable solution I can attempt a PR to use node-pool
to pool child processes rather than using the current homemade {retain, release}
API.
@leontastic the child-pool in bull is just a cache for child processes, it does not care about how many child processes you are using. Therefore, if a child process exit, the next time bull tries to run a job it will create a new child process. Or at least that is the intent, if you can write a test case that demonstrates that this is not the case I can take a look into it.
@manast Here's the failing test case: https://github.com/OptimalBits/bull/pull/750
Depends on https://github.com/OptimalBits/bull/pull/749
Was just doing some researching along these lines and writing very similar test to #750 to verify.
// In test_sandboxed_process.js
it('should continue running jobs after exited process', function(done) {
let count = 0;
queue.on('completed', function() {
count++;
if(count === 2) {
done();
}
});
queue.add({ foo: 'bar' });
queue.add({ foo: 'baz' });
queue.process(__dirname + '/fixtures/fixture_processor_exit.js');
});
// Test runs first job, and then fails and times out after 5 seconds.
I think the reason it fails is because the promise in sandbox.js
never resolves or rejects if a process does process.exit([code])
. It only resolves based on messaging.
I see the #750 pull request is closed, but it would certainly fix some issues I am having.
I like the option with killOnComplete
for better sandboxing as well, as I have been bitten a bit with leaky processes (ex. scripts using cookie jar in request library).
Can someone suggest how to kill the child process after the job was timeout? I knew we have function process.exit(code)
, but it needs to be called, and I can't call that function in stalled jobs.
In my case, it near impossible to detect what causes stalled jobs, and it takes 5-10 min to complete,