node-cassandra-cql
node-cassandra-cql copied to clipboard
really awful performance when using `executeAsPrepared` and running many single inserts
Got a single cassandra db running on another server. I'm inserting into it with a script that looks like:
var colAndValues, emitter, statement, v;
colAndValues = "(" + (this.fieldsWithValue.join(',')) + ") values (" + (((function() {
var _i, _len, _results;
_results = [];
for (_i = 0, _len = values.length; _i < _len; _i++) {
v = values[_i];
_results.push('?');
}
return _results;
})()).join(', ')) + ")";
statement = "INSERT INTO " + this.table + " " + colAndValues + ";";
emitter = this;
this.dbConnection.executeAsPrepared(statement, values, cql.types.consistencies.one, function(err) {
if (err) {
// emit error event
} else {
// emit complete event
}
});
OK that all generates something that looks like:
statement:
INSERT INTO data (order_id,order_ts,transaction_id,transaction_discount,transaction_qty,product_category,product_profit,product_upc,product_name,product_price,product_distributor,store_id,store_name,store_state,store_region,id) values (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?);
values (as json)
["9QF7XXDK6T","2014-06-06T16:13:43.000Z","8WRXR5YMTJ",{"value":0,"hint":7},3,"Toys and Video Games",4.54,"TH2677D893","Lego Star Wars BWing",17.92,"AS6","W25V2CCVH","Sporket Boise","ID","Mountain",{"value":"a5107054-ee97-4434-a6f0-354215afc27c","hint":12}]
My time on 8 inserts is about 1/10th of a second. My time on 40-50 inserts is 12 seconds. My time for 3-4k inserts is so bad I have to kill the process. I've got very similar code that inserts to mongo and I can get 1000s of inserts per second.
My first guess is that executeAsPrepared
is the wrong choice for this sort of thing. But wow, even then, this is really really bad. Do I need to provide hints for all the fields so I can skip the "guess the encoding" step?
I'm getting similar (but better) results from .execute
. Inserting 400-600 records is taking about 5 minutes so far. I won't be able to use this lib if this is the kind of performance I'm getting.
I should clarify how this process works. I've got code that generates dummy sales data. So it'll generate n
"orders" and each of those will contain 1-10 "transactions" (which is basically a swipe over a product scanner at the market). Building those objects takes under a second. Then I roll through the list of objects and push them to the db. Orders by definition will come in on a stream of sorts so batching them is out.
I'll investigate the server a bit, perhaps there's an issue there. But I suspect I'll swap my cassandra insert for a mongo insert and see far far better performance. Something is up with this library.
And the results of letting that 400+ record insert just run and see how it goes: FATAL ERROR: CALL_AND_RETRY_2 Allocation failed - process out of memory
I must be doing something wrong, right?
It looks like you are doing something wrong, as an example the driver should take few microseconds to create a request and another few microseconds for decoding the data and Cassandra should take from microseconds to a couple of milliseconds per request.
Also, the Node.js Out of Memory error is another pointer that something is not right. You maybe are doing a sync loop of async operations
while (flag) {
//this is an out of control sync loop on an async operation
execute(param, callback);
}
I'm not using any sort of loop, it's all event driven. I've got one object that receives a message over ZMQ, builds an object from it and does some manipulation of the object. It then emits a 'complete' event and the manipulated object. There's another object (the cassandra inserter) that is listening for that 'complete' event and takes the object and inserts it. I'm trying to find how fast it will go, so I'm just letting it spin. I had a mongodb inserter in the spot that the cassandra insert is now and I was getting 1k-1.5k inserts per second.
Some questions:
- Are you reusing the same client instance (you should use 1)?
- Are you warming up the pool by using client.connect method?
- Have you enabled logging on the driver to see what is going on?
- Is the generated query always the same (if not, you should not use prepared statements)?