node-webshot icon indicating copy to clipboard operation
node-webshot copied to clipboard

High CPU in loops

Open tony-schumacher opened this issue 7 years ago • 5 comments

I am using webshot for creating a lot of screenshots from different URLs in a loop. To do this, I am calling it only one after another. The problem is, it will only work for around 5-6users at once, or the node server will freeze.

I noticed that webshot is very cpu intensive an will consume 100% of a small EC2 instance easily.

Is there a way to make it even more cpu friendly or maybe use the same phantom-proces for each url?

Thanks a lot!

tony-schumacher avatar Nov 07 '16 23:11 tony-schumacher

I'm having a similar problem. Have you found a solution?

oraneedwards avatar Dec 10 '16 22:12 oraneedwards

I swtiched to phantom js and build the tool on my own.

tony-schumacher avatar Dec 11 '16 10:12 tony-schumacher

@TonySchu are you able to share that solution?

KClough avatar Mar 16 '17 23:03 KClough

@KClough I used npm phantom. But you will get the same problem, so you need to build the loop on your own.

Here is what I am doing:

  1. Create a phantom instance
  2. create a browser tab
  3. open your url
  4. wait
  5. get HTML content
  6. get screenshot
  7. close the browser tab (not phantom) --> go back to 2. and repeat -->if loop count > 10, close phantom and go back to 1. -->if error try to kill the process with the pid from 1. and start from 1.

Problems: If you just use one instance, you can't use it for 500URLS or so... Phantom is a little buggy, so I close the 1. after 10loops and start from new. Some other people are also doing it this way and it is working really good.

Also make sure the store the pid of each phantom in an array, so you can destroy the process if there is a bug. Otherwise you will get a memory leak pretty fast.

A little ugly, but I hope it helps (don't mind 'webshot' in my code... I just did not refactor it.): `
// api for frontend app app.post('/api/webshot', function (req, res) { var crawlStatus = {index: 0, max: req.body.length}; initPhantom(req.body, crawlStatus); res.send("image crawler is running"); });

//create brwoser instance
function initPhantom(todos, crawlStatus) {
    phantom.create(['--ignore-ssl-errors=no', '--load-images=true'], {logLevel: 'error'})
        .then(function (instance) {
            console.log("===================> instance: ", instance.process.pid);
            phantomChildren.push(instance.process.pid);
            webshot(0, todos, instance.process.pid, instance, crawlStatus);
        }).catch(function (e) {
        console.log('Error in initPhantom', e);
        errorCounts.push(e);
        totalErrors.push(e);
        killProcesses();
    });
}

// create tab in brwoser and make screenshot
function webshot(id, shots, processId, phInstance, crawlStatus) {
    // avoid too much ram and memory leak in phantom
    if (id >= 10) {
        phInstance.exit();
        restartIfError(id, shots, null, crawlStatus)
    } else {
        phInstance.createPage().then(function (page, error) {
            if (error) {
                console.log("first", error);
            }
            page.property('viewportSize', {width: 1024, height: 768});
            page.setting("resourceTimeout", 7000);
            return page.open(shots[id].url)
                .then(function (status) {
                    setTimeout(function () {
                        //get content html
                        var content = page.property('content');
                        return content
                            .then(function (content) {
                                // screenhots
                                console.log("render %s / %s", id + 1, shots.length, "processId:", processId);
                                crawlStatus.index += 1;
                                var image = 'temp_img/' + shots[id]._id + '.png';
                                page.render(image, {format: 'png', quality: '30'})
                                    .then(function (finished, error) {
                                        if (error) {
                                            console.log(error)
                                        }
                                        page.close();
                                        makeImageFromUrl(shots[id], image, content, crawlStatus);
                                        if (id < shots.length - 1) {
                                            id += 1;
                                            webshot(id, shots, processId, phInstance, crawlStatus);
                                        } else {
                                            console.log("===================> all done: %s files has been written", shots.length, "processId:", processId, "user:", shots[id].user);
                                            phInstance.exit();
                                        }
                                    }).catch(function (e) {
                                    console.log("last before - processID: ", processId, e);
                                    restartIfError(id, shots, processId, crawlStatus)
                                });
                            });
                    }, 5000);
                }).catch(function (e) {
                    console.log("last one", e);
                    restartIfError(id, shots, processId, crawlStatus)
                })
        });
    }
}

function restartIfError(id, shots, p_id, crawlStatus) {
    if (p_id) {
        try {
            console.log("try to kill: ", p_id);
            process.kill(p_id)
        } catch (err) {
            //
        }
    }
    console.log("Restart webshot");
    shots = shots.slice(id);
    initPhantom(shots, crawlStatus);
}`

tony-schumacher avatar Mar 17 '17 07:03 tony-schumacher

Interesting, I've had this problem on another web scraper based project. I'm glad to hear its not just me seeing memory leaks.

Thanks for this code. This is very helpful.

KClough avatar Mar 17 '17 15:03 KClough