highland Serious Performance Issue With Pipeline

I've been banging my head against this for the past few days after we got some pretty bad performance results following a recent load test. At first I thought it was a memory leak in our code but on further investigation I think the increase in memory usage was because pipeline was causing the CPU to max out so much that the garbage collector was fighting for CPU cycles. A stripped down test case showed the CPU hitting 99-100% for some pretty trivial stream processing with the memory staying constant.

Here's the test case:

var _ = require('highland');

var bigArray = []
for(var i = 0; i < 100000; i++) {
    bigArray[i] = i;
}
var s = _(bigArray);

function addOne(a) {
    return a + 1;
}

function bigAdd(a) {
    for(var i = 0; i < 500; i++) {
        a++;
    }
    return a;
}

function bigSubtract(a) {
    for(var i = 0; i < 100; i++) {
        a--;
    }
    return a;
}

console.time('timer');
s.through(_.pipeline(_.map(addOne), _.map(bigAdd), _.map(bigSubtract))).toArray(function () {
    console.timeEnd('timer');
});

pipeline_cpu

This takes nearly 40 seconds to run my machine whereas this only takes 142ms:

console.time('timer');
s.through(_.map(addOne)).through(_.map(bigAdd)).through(_.map(bigSubtract)).toArray(function () {
    console.timeEnd('timer');
});

If I increase the size of bigArray to 1000000 then the first version takes so long that I have to kill it (compared to 1.3 seconds for the other version).

The problem seems to in the wrapper part of the pipeline function:

var wrapper = _(function (push, next) {
        end.pull(function (err, x) {
            if (err) {
                wrapper._send(err);
                next();
            }
            else if (x === nil) {
                wrapper._send(null, nil);
            }
            else {
                wrapper._send(null, x);
                next();
            }
        });
    });
    wrapper.write = function (x) {
        start.write(x);
    };

The thing that's stumping me is that both pull and send really don't seem to do very much so I'm not sure how they can be leading to such a massive jump in CPU use. I want to take a stab at fixing this but I'll need a couple of pointers to get me started.

Apr 15 '15 18:04 svozza

Looks like pull actually sucks quite a bit in 2.x. I always suspected this (but never got around to verifying), and it was one of the reasons why I wanted to rewrite the engine.

I ran your code against 3.0.0 and got

pipeline: 342ms
direct: 182ms

I suspect it's because pull is implemented using consume, which creates a new stream object. This happens per element pushed, so it could be causing the GC to freak out. Maybe you could try to re-implement pipeline using consume somehow? You'll have to be careful to preserve laziness.

Ideally, we'd move to 3.0.0 sooner rather than later.

Apr 15 '15 20:04 vqvu

Yeah, I'd seen that new stream being created; I didn't expect it to have that big an impact but thinking about it, making 1m objects that are basically discarded is obviously going to have a huge effect. I'll take a look into refactoring pipeline. Agreed about 3.0 but we've to get 2.5 out first! :laughing:

Apr 15 '15 20:04 svozza